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Abstract 


A  wide  range  of  machine  learning  problems,  including  astronomical  inference 
about  galaxy  clusters,  natural  image  scene  classification,  parametric  statistical  infer¬ 
ence,  and  detection  of  potentially  harmful  sources  of  radiation,  can  be  well-modeled 
as  learning  a  function  on  (samples  from)  distributions.  This  thesis  explores  problems 
in  learning  such  functions  via  kernel  methods,  and  applies  the  framework  to  yield 
state-of-the-art  results  in  several  novel  settings. 

One  major  challenge  with  this  approach  is  one  of  computational  efficiency  when 
learning  from  large  numbers  of  distributions:  the  computation  of  typical  methods 
scales  between  quadratically  and  cubically,  and  so  they  are  not  amenable  to  large 
datasets.  As  a  solution,  we  investigate  approximate  embeddings  into  Euclidean 
spaces  such  that  inner  products  in  the  embedding  space  approximate  kernel  values 
between  the  source  distributions.  We  provide  a  greater  understanding  of  the  standard 
existing  tool  for  doing  so  on  Euclidean  inputs,  random  Fourier  features.  We  also 
present  a  new  embedding  for  a  class  of  information-theoretic  distribution  distances, 
and  evaluate  it  and  existing  embeddings  on  several  real-world  applications. 

The  next  challenge  is  that  the  choice  of  distance  is  important  for  getting  good 
practical  performance,  but  how  to  choose  a  good  distance  for  a  given  problem  is 
not  obvious.  We  study  this  problem  in  the  setting  of  two-sample  testing,  where 
we  attempt  to  distinguish  two  distributions  via  the  maximum  mean  divergence,  and 
provide  a  new  technique  for  kernel  choice  in  these  settings,  including  the  use  of 
kernels  defined  by  deep  learning-type  models. 

In  a  related  problem  setting,  common  to  physical  observations,  autonomous 
sensing,  and  electoral  polling,  we  have  the  following  challenge:  when  observing 
samples  is  expensive,  but  we  can  choose  where  we  would  like  to  do  so,  how  do  we 
pick  where  to  observe?  We  give  a  method  for  a  closely  related  problem  where  we 
search  for  instances  of  patterns  by  making  point  observations. 

Throughout,  we  combine  theoretical  results  with  extensive  empirical  evaluations 
to  increase  our  understanding  of  the  methods. 


Acknowledgements 

To  start  with,  I  want  to  thank  my  advisor,  Jeff  Schneider,  for  so  ably  guiding  me  through  this 
long  process.  When  I  began  my  PhD,  I  didn’t  have  a  particular  research  agenda  or  project  in 
mind.  After  talking  to  Jeff  a  few  times,  we  settled  on  my  joining  an  existing  project:  applying  this 
crazy  idea  of  distribution  kernels  to  computer  vision  problems.  Obviously,  the  hunch  that  this 
project  would  turn  into  something  I’d  be  interested  in  working  on  more  worked  out.  Throughout 
that  project  and  the  ones  I’ve  worked  on,  Jeff  has  been  instrumental  in  asking  the  right  questions 
to  help  me  realize  when  I’ve  gone  off  in  a  bad  direction,  in  suggesting  better  alternatives,  and 
in  thinking  pragmatically  about  problems,  using  the  best  tools  for  the  job  and  finding  the  right 
balance  between  empirical  and  theoretical  results. 

Barnabas  Poczos  also  basically  would  have  been  an  advisor  had  he  not  still  been  a  postdoc 
when  I  started  my  degree.  His  style  is  a  great  complement  to  Jeff’s,  caring  about  many  of  the 
same  things  but  coming  at  them  from  a  somewhat  different  direction.  Nina  Balcan  knew  the  right 
connections  from  the  theoretical  community,  which  I  haven’t  fully  explored  enough  yet.  I  began 
working  in  more  depth  with  Arthur  Gretton  over  the  past  six  months  or  so,  and  it’s  been  both  very 
productive  and  great  fun,  which  is  great  news  since  I’m  now  very  excited  to  move  on  to  a  postdoc 
with  him. 

Several  of  my  labmates  have  also  been  key  to  everything  I’ve  done  here.  Liang  Xiong  helped 
me  get  up  and  running  when  I  started  my  degree,  spending  hours  in  my  office  showing  me  how 
things  worked  and  how  to  make  sense  of  the  results  we  got.  Yifei  Ma’s  enthusiasm  about  widely 
varying  research  ideas  was  inspiring.  Junier  Oliva  always  knew  the  right  way  to  think  about 
something  when  I  got  stuck.  Tzu-Kuo  Huang  spent  an  entire  summer  thinking  about  distribution 
learning  with  me  and  eating  many,  many  plates  of  chicken  over  rice.  Roman  Garnett  is  a  master 
of  Gaussian  processes  and  appreciated  my  disappointment  in  Pittsburgh  pizza.  I  never  formally 
collaborated  much  with  Ben,  Maria,  Matt,  Sarny,  Sibi,  or  Xuezhi,  but  they  were  always  fun  to  talk 
about  ideas  with.  The  rest  of  the  Auton  Lab,  especially  Artur  Dubrawski,  made  brainstorming 
meetings  something  to  look  forward  to  whenever  they  happened. 

Outside  the  lab,  Michelle  Ntampaka  was  a  joy  to  collaborate  with  on  applications  to  cosmology 
problems,  even  when  she  was  too  embarrassed  to  show  me  her  code  for  the  experiments.  The 
rest  of  the  regular  Astro/Stat/ML  group  also  helped  fulfill,  or  at  least  feel  like  I  was  fulfilling, 
my  high  school  dreams  of  learning  about  the  universe.  Fish  Tung  made  crazy  things  work.  The 
xdata  crew  made  perhaps-too-frequent  drives  to  DC  and  long  summer  days  packed  in  a  small 
back  room  worth  it,  especially  frequent  Phronesis  collaborators  Ben  Johnson,  who  always  had  an 
interesting  problem  to  think  about,  and  Casey  King,  who  always  knew  an  interesting  person  to 
talk  to. 

Karen  Widmaier,  Deb  Cavlovich,  and  Catherine  Copetas  made  everything  run  smoothly: 
without  them,  not  only  would  nothing  ever  actually  happen,  but  the  things  that  did  happen  would 
be  far  less  pleasant.  Jarod  Wang  and  Predrag  Punosevac  kept  the  lab  machines  going,  despite  my 
best  efforts  to  crash,  overload,  or  otherwise  destroy  them. 

Other,  non-research  friends  also  made  this  whole  endeavor  worthwhile.  Alejandro  Carbonara 
always  had  jelly  beans,  Ameya  Velingker  made  multiple  spontaneous  trips  across  state  lines, 
Aram  Ebtekar  single-handedly  and  possibly  permanently  destroyed  my  sleep  schedule,  Dave 
Kurokawa  was  a  good  friend  (there’s  no  time  to  explain  why),  Shayak  Sen  received  beratement 


with  grace,  and  Sid  Jain  gave  me  lots  of  practice  for  maybe  having  a  teenager  of  my  own  one  day. 
Listing  friends  in  general  a  futile  effort,  but  here’s  a  few  more  of  the  Pittsburgh  people  without 
whom  my  life  would  have  been  worse:  Alex,  Ashique,  Brendan,  Charlie,  Danny,  the  Davids, 
Goran,  Jay-Yoon,  Jenny,  Jesse,  John,  Jon,  Junxing,  Karim,  Kelvin,  Laxman,  Nic,  Nico,  Preeti, 
Richard,  Ryan,  Sarah,  Shriphani,  Vagelis,  and  Zack,  as  well  as  the  regular  groups  for  various 
board/tabletop  games,  everyone  else  who  actually  participated  in  departmental  social  events,  and 
machine  learning  conference  buddies.  Of  course,  friendship  is  less  constrained  by  geography 
than  it  once  was:  Alex  Burka,  James  Smith,  and  Tom  Eisenberg  were  subjected  to  frequent 
all-day  conversations,  the  Board  of  Shadowy  Figures  and  the  PTV  Mafia  group  provided  much 
pleasant  distraction,  and  I  spent  more  time  on  video  chats  with  Jamie  McClintock,  Luke  Collin, 
Matt  McLaughlin,  and  Tom  McClintock  than  was  probably  reasonable. 

My  parents  got  me  here  and  helped  keep  me  here,  and  even  if  my  dad  thinks  I  didn’t  want  him 
at  my  defense,  I  fully  acknowledge  that  all  of  it  is  only  because  of  them.  My  brother  Ian  and  an 
array  of  relatives  (grandparents,  aunts,  uncles,  cousins,  and  the  more  exotic  ones  like  first  cousins 
once  removed  and  great-uncles)  are  also  a  regular  source  of  joy,  whether  I  see  them  several  times 
a  year  or  a  handful  of  times  a  decade.  Thank  you. 


Contents 


1  Introduction  1 

1.1  Summary  of  contributions  .  2 

2  Learning  on  distributions  5 

2.1  Distances  on  distributions .  5 

2.1.1  Distance  frameworks .  5 

2. 1 .2  Specific  distributional  distances .  7 

2.2  Estimators  of  distributional  distances .  10 

2.3  Kernels  on  distributions .  13 

2.4  Kernels  on  sample  sets .  14 

2.4.1  Handling  indefinite  kernel  matrices .  15 

2.4.2  Nystrom  approximation  .  16 

3  Approximate  kernel  embeddings  via  random  Fourier  features  19 

3.1  Setup  .  19 

3.2  Reconstruction  variance .  20 

3.3  Convergence  bounds  .  23 

3.3.1  L2  bound .  23 

3.3.2  High-probability  uniform  bound .  24 

3.3.3  Expected  max  error .  26 

3.3.4  Concentration  about  the  mean .  27 

3.3.5  Other  bounds .  29 

3.4  Downstream  error .  29 

3.4.1  Kernel  ridge  regression .  30 

3.4.2  Support  vector  machines .  31 

3.5  Numerical  evaluation  on  an  interval  .  33 

4  Scalable  distribution  learning  with  approximate  kernel  embeddings  37 

4.1  Mean  map  kernels .  37 

4.1.1  Convergence  bounds .  38 

4.2  L2  distances .  43 

4.2.1  Connection  to  mmd  embedding  .  44 

4.3  Information-theoretic  distances .  45 

4.3.1  Convergence  bound .  48 


1 


4.3.2  Generalization  to  Q--HDDS .  49 

4.3.3  Connection  to  mmd .  49 

5  Applications  of  distribution  learning  51 

5.1  Dark  matter  halo  mass  prediction .  51 

5.2  Mixture  estimation .  54 

5.3  Scene  recognition .  56 

5.3.1  sift  features .  56 

5.3.2  Deep  features .  57 

5.4  Small-sensor  detection  of  radiation  sources  .  58 

6  Choosing  kernels  for  hypothesis  tests  61 

6.1  Estimators  of  mmd . _.  .  . .  62 

6.2  Estimators  of  the  variance  of  mmd2 .  63 

6.3  mmd  kernel  choice  criteria .  65 

6.3.1  Median  heuristic .  65 

6.3.2  Marginal  likelihood  maximization .  66 

6.3.3  Maximizing  mmd .  66 

6.3.4  Cross-validation  of  loss .  66 

6.3.5  Cross-validation  of  power .  67 

6.3.6  Embedding-based  Hotelling  stastistic  .  67 

6.3.7  Streaming  /-statistic  .  67 

6.3.8  Pairwise  /-statistic  .  68 

6.4  Experiments .  69 

6.4.1  Same  Gaussian .  70 

6.4.2  Gaussian  variance  difference .  70 

6.4.3  Blobs .  72 

7  Active  search  for  patterns  75 

7.1  Related  work .  75 

7.2  Problem  formulation .  76 

7.3  Method .  78 

7.3.1  Analytic  expected  utility  for  functional  probit  models .  79 

7.3.2  Analysis  for  independent  regions .  80 

7.4  Empirical  evaluation .  81 

7.4.1  Environmental  monitoring  (linear  classifier) .  81 

7.4.2  Predicting  election  results  (linear  classifier) .  83 

7.4.3  Finding  vortices  (black-box  classifier) .  85 

8  Conclusions  and  future  directions  89 

8.1  Deep  learning  of  kernels  for  two-sample  testing .  90 

8.2  Deep  learning  of  kernels  for  distribution  learning .  90 

8.2.1  Integration  with  deep  computer  vision  models .  91 

8.2.2  Other  paramaterizations  for  kernel  learning .  91 

ii 


8.3  Word  and  document  embeddings  as  distributions  .  92 

8.4  Active  learning  on  distributions .  93 

A  The  ski -groups  package  95 

B  Proofs  for  Chapter  3  97 

B.l  Proof  of  Proposition  3.4 .  97 

B.2  Proof  of  Proposition  3.5 .  98 

B.3  Proof  of  Proposition  3.6 .  99 

B.3.1  Regularity  Condition .  99 

B.3. 2  Lipschitz  Constant . 100 

B.3. 3  Anchor  Points . 101 

B.3. 4  Optimizing  Over  r . 101 

B.4  Proof  of  Proposition  3.7 . 102 

B.4.1  Regularity  Condition . 102 

B.4. 2  Lipschitz  Constant . 102 

B.4. 3  Anchor  Points . 103 

B  .4.4  Optimizing  Over  r . 104 

B.5  Proof  of  Proposition  3.8 . 104 

B. 6  Proof  of  Proposition  3.9 . 106 

C  Proofs  for  Chapter  4  109 

C. l  Proof  of  Proposition  4.10 . 109 

Bibliography  113 


iii 


iv 


Chapter  1 

Introduction 


Traditional  machine  learning  approaches  focus  on  learning  problems  defined  on  vectors,  mapping 
whatever  kind  of  object  we  wish  to  model  to  a  fixed  number  of  real-valued  attributes.  Though 
this  approach  has  been  very  successful  in  a  variety  of  application  areas,  choosing  natural  and 
effective  representations  can  be  quite  difficult. 

In  many  settings,  we  wish  to  perform  machine  learning  tasks  on  objects  that  can  be  viewed  as 
a  collection  of  lower-level  objects  or  more  directly  as  samples  from  a  distribution.  For  example: 

•  Images  can  be  thought  of  as  a  collection  of  local  patches  (Section  5.3);  similarly,  videos 
are  collections  of  frames. 

•  The  total  mass  of  a  galaxy  cluster  can  be  predicted  based  on  the  positions  and  velocities  of 
individual  galaxies  (Section  5.1). 

•  The  photons  recieved  by  a  small  radiation  sensor  can  be  used  to  classify  the  presence  of 
harmful  radioactive  material  (Section  5.4). 

•  Support  for  a  political  candidate  among  various  demographic  groups  can  be  estimated  by 
learning  a  regression  model  from  electoral  districts  of  individual  voters  to  district-level 
support  for  political  candidates  (Flaxman,  Y.-X.  Wang,  et  al.  2015). 

•  Documents  are  made  of  sentences,  which  are  themselves  composed  of  words,  which  them¬ 
selves  can  be  seen  as  being  represented  by  sets  of  the  contexts  in  which  they  appear 
(Section  8.3). 

•  Parametric  statistical  inference  problems  learn  a  function  from  sample  sets  to  model  pa¬ 
rameters  (Section  5.2). 

•  Expectation  propagation  techniques  relay  on  maps  from  sample  sets  to  messages  normally 
computed  via  expensive  numerical  integration  (Jitkrittum,  Gretton,  et  al.  2015). 

•  Causal  arrows  between  distributions  can  be  estimated  from  samples  (Lopez-Paz  et  al.  2015). 

In  order  to  use  traditional  techniques  on  these  collective  objects,  we  must  create  a  single 

vector  that  represents  the  entire  set.  Though  there  are  various  ways  to  summarize  a  set  as  a  vector, 
we  can  often  discard  less  information  and  require  less  effort  in  feature  engineering  by  operating 
directly  on  sets  of  feature  vectors. 

One  method  for  machine  learning  on  sets  is  to  consider  them  as  samples  from  some  unknown 
underlying  probability  distribution  over  feature  vectors.  Each  example  then  has  its  own  distribu- 
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tion:  if  we  are  classifying  images  as  sets  of  patches,  each  image  is  defined  as  a  distribution  over 
patch  features,  and  each  class  of  clusters  is  a  set  of  patch-level  feature  distributions.  We  can  then 
define  a  kernel  based  on  statistical  estimates  of  a  distance  between  probability  distributions.  Let¬ 
ting  X  c  Rrf  denote  the  set  of  possible  feature  vectors,  we  thus  define  a  kernel  k  :  2A  x  2A  — >  R. 
This  lets  us  perform  classification,  regression,  anomaly  detection,  clustering,  low-dimensional 
embedding,  and  any  of  many  other  applications  with  the  well-developed  suite  of  kernel  methods. 
Chapter  2  discusses  various  such  kernels  and  their  estimators;  Chapter  5  gives  empirical  results 
on  several  problems. 

When  used  for  a  learning  problem  with  N  training  items,  however,  typical  kernel  methods 
require  operating  on  an  tV  x  ./V  kernel  matrix,  which  requires  far  too  much  computation  to  scale  to 
datasets  with  a  large  number  of  instances.  One  way  to  avoid  this  problem  is  through  approximate 
embeddings  z  :  X  — >  RD,  a  la  Rahimi  and  Recht  (2007),  such  that  z(x)Jz(y)  *  k(x,  y).  Chapter  3 
gives  some  new  results  in  the  theory  of  random  Fourier  embeddings,  while  Chapter  4  uses  them 
as  a  tool  in  developing  embeddings  for  several  distributional  kernels,  which  are  also  evaluated 
empirically  in  Chapter  5. 

Chapter  6  moves  to  the  related  problem  of  two-sample  testing.  Here,  we  are  given  two  sample 
sets  X  and  Y,  and  we  wish  to  test  the  hypothesis  that  X  and  Y  were  generated  from  the  same 
distribution.  This  problem,  closely  related  to  classification,  has  many  practical  applications; 
one  primary  method  for  doing  so  is  based  on  the  maximum  mean  discrepancy  (mmd)  between 
the  distributions.  This  method  relies  on  a  base  kernel;  Chapter  6  develops  and  evaluates  a  new 
method  for  selecting  these  kernels,  including  complex  kernels  based  on  deep  learning. 

Chapter  7  addresses  the  application  of  this  type  of  complex  functional  classifier  to  an  active 
search  problem.  Consider  finding  polluted  areas  in  a  body  of  water,  based  on  point  measure¬ 
ments.  We  wish  to,  given  an  observation  budget,  adaptively  choose  where  we  should  make  these 
observations  in  order  to  maximize  the  number  of  regions  we  can  be  confident  are  polluted.  If 
our  notion  of  “pollution”  is  defined  simply  by  a  threshold  on  the  mean  value  of  a  univariate 
measurement,  Y.  Ma,  Garnett,  et  al.  (2014)  give  a  natural  selection  algorithm  based  on  Gaussian 
process  inference.  If,  instead,  our  sensors  measure  the  concentrations  of  several  chemicals,  the 
vector  flow  of  water  current,  or  other  such  more  complicated  data,  we  can  instead  apply  a  classifier 
to  a  region  and  consider  the  problem  of  finding  regions  that  the  classifier  marks  as  relevant. 

1.1  Summary  of  contributions 

•  Chapter  2  mostly  establishes  the  framework  with  which  we  will  discuss  learning  on  distri¬ 
butions.  Section  2.4.2  includes  a  mildly  novel  analysis  not  yet  published.1 

•  Chapter  3  improves  the  theoretical  understanding  of  the  random  Fourier  features  of  Rahimi 
and  Recht  (2007).  (Based  on  Sutherland  and  Schneider  2015.) 

•  Section  4.3  gives  an  approximate  embedding  for  a  new  class  of  distributional  distances. 
(Based  on  Sutherland,  J.  B.  Oliva,  et  al.  2016.) 

•  Chapter  5  provides  empirical  studies  for  the  application  of  distributional  distances  to 
practical  problems.  (Based  on  Poczos,  Xiong,  Sutherland,  et  al.  2012;  Sutherland,  Xiong, 

‘This  was  developed  with  Tzu-Kuo  (TK)  Huang. 
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et  al.  2012;  Ntampaka,  Trac,  Sutherland,  Battaglia,  et  al.  2015;  Jin  2016;  Jin  et  al.  2016; 
Sutherland,  J.  B.  Oliva,  et  al.  2016;  Ntampaka,  Trac,  Sutherland,  Fromenteau,  et  al.  in 

press.) 

•  Chapter  6  develops  and  evaluates  a  new  method  for  kernel  selection  in  two-sample  testing 
based  on  the  mmd  distributional  distance.  (Work  not  yet  published.2) 

•  Chapter  7  presents  and  analyzes  a  method  for  the  novel  problem  setting  of  active  pointillistic 
pattern  search ,  using  point  observations  to  observe  regional  patterns.  (Based  on  Y.  Ma, 
Sutherland,  et  al.  015.) 

•  The  ski -groups  package,  overviewed  in  Appendix  A,  provides  efficient  implementations 
of  several  of  the  methods  for  learning  on  distributions  discussed  in  this  thesis. 


2Done  in  collaboration  with  Fish  Tung,  Aaditya  Ramdas,  Heiko  Strathmann,  Alex  Smola,  and  Arthur  Gretton. 
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Chapter  2 

Learning  on  distributions 


As  discussed  in  Chapter  1,  we  consider  the  problem  of  learning  on  probability  distributions. 
Specifically:  let  X  c  Rd  be  the  set  of  observable  feature  vectors,  S  the  set  of  possible  sample 
sets  (all  finite  subsets  of  A),  and  P  the  set  of  probability  distributions  under  consideration.  We 
then  perform  machine  learning  on  samples  from  distributions  by: 

1.  Choosing  a  distance  on  distributions  p  :  P  xP  —>  R. 

2.  Defining  a  Mercer  kernel  CPxP->R  based  on  p. 

3.  Estimating  k  based  on  the  observed  samples  as  k  :  S  x  S  — »  R,  which  should  itself  be  a 
kernel  on  S. 

4.  Using  k  in  a  standard  kernel  method,  such  as  an  svm  or  a  Gaussian  Process,  to  perform 
classification,  regression,  collective  anomaly  detection,  or  other  machine  learning  tasks. 

Certainly,  this  is  not  the  only  approach  to  learning  on  distributions.  Some  distributional 
learning  methods  do  not  directly  compare  sample  sets  to  one  another,  but  rather  compare  their 
elements  to  a  class-level  distribution  (Boiman  et  al.  2008).  Given  a  distance  p,  one  can  naturally 
use  k-nearest  neighbor  models  (Poczos,  Xiong,  and  Schneider  2011;  Kusner  et  al.  2015),  or 
Nadaraya-Watson-type  local  regression  models  (J.  B.  Oliva,  Poczos,  et  al.  201  ;  Poczos,  Rinaldo, 
et  al.  20L  )  with  respect  to  that  distance.  In  this  thesis,  however,  we  focus  on  kernel  methods  as  a 
well-studied,  flexible,  and  empirically  effective  approach  to  abroad  variety  of  learning  problems. 

We  typically  assume  that  every  distribution  in  P  has  a  density  with  respect  to  the  Lebesgue 
measure,  and  slightly  abuse  notation  by  using  distributions  P,  Q  and  their  densities  p,  q  inter¬ 
changeably. 


2.1  Distances  on  distributions 

We  will  define  kernels  on  distributions  by  first  defining  distances  p  between  them. 

2.1.1  Distance  frameworks 

We  first  present  four  general  frameworks  for  distances  on  distributions.  These  are  each  broad 
categories  of  distances  containing  (or  related  to)  several  of  the  concrete  distance  families  we 
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employ. 


L r  metrics  One  natural  way  to  compute  distances  between  distributions  is  the  L,  metric  between 
their  densities,  for  order  r  >  1: 


Lr(p,  q )  :=  \^jjp(x)  ~  <7Wr  d* 


Note  that  the  limit  r  -  oo  yields  the  distance  Loo(p,  q )  =  supxe/Y|p(x)  -  q(x) |. 
/-divergences  For  any  convex  function  /  with  /(l)  =  0,  the  /-divergence  of  P  to  Q  is 


This  class  is  sometimes  called  “Csiszar  /-divergences”,  after  Csiszar  ( [963).  Sometimes  the 
requirement  of  convexity  or  that  /( 1)  =  0  is  dropped.  Note  that  these  functions  are  not  in  general 
symmetric  or  respecting  of  the  triangle  inequality.  They  do,  however,  satisfy  Df{P\\P)  =  0,  when 
/  is  strictly  convex  at  1  Df{P\\Q)  >  0,  and  are  jointly  convex: 


Df(AP  +  (1  -  A)P'\\AQ  +  (1  -  A)Q')  <  ADf(P\\Q)  +  (1  -  A)Df{P'\\Q'). 


In  fact,  the  only  metric  /-divergences  are  multiples  of  the  total  variation  distance,  discussed 
shortly  (Khosravifard  et  al.  2007)  —  though  e.g.  the  Hellinger  distance  is  the  square  of  a  metric. 
For  an  overview,  see  e.g.  Liese  and  Vajda  (  )06). 

a-/3  divergences  The  following  somewhat  less-standard  divergence  family,  defined  e.g.  by 
Poczos,  Xiong,  Sutherland,  et  al.  (2012)  generalizing  the  cr-divergence  of  Amari  (  985),  is  also 
useful.  Given  two  real  parameters  a,  J3,  Da^  is  defined  as 


Da,p(P\\Q)  ^  0  for  any  a,  ft ;  Da_a(P\\P)  =  1.  Note  also  that  Da^a  has  the  form  of  an  /- 


divergence  with  1 1— ■»  ta+1,  though  this  does  not  satisfy  /( 1)  =  0  and  is  convex  only  if  a  £  (-1, 0). 


Integral  probability  metrics  Many  useful  metrics  can  be  expressed  as  integral  probability 
metrics  (ipms,  Muller  1997): 


where  g  is  some  family  of  functions  /  :  X  — >  R.  Note  that  pg  satisfies  pg(P,  P)  =  0,  pg(P,  Q)  = 
pg(<2,  P ),  and  pg(P,  Q )  <  pg(/J.  R)  +  pgfft,  Q )  for  any  a,  and  is  thus  always  a  pseudometric;  the 
remaining  metric  property  of  distinguishability,  (pg(P,  Q )  =  0)  =>  (P  =  Q ),  depends  on  g- 
Sriperumbudur  et  al.  (2009)  give  an  overview. 
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2.1.2  Specific  distributional  distances 


The  various  distributional  distances  below  can  often  be  represented  in  one  or  more  of  these 
frameworks.  Many  more  such  distances  exist;  here  we  mainly  discuss  the  ones  used  in  this  thesis, 
along  with  a  few  others  of  interest.  Figure  2.1  gives  a  visual  illustration  of  several  of  the  distances 
considered  here. 

L2  distance  The  L2  distance  is  one  of  the  most  common  metrics  used  on  distributions.  It  can 
also  be  represented  as  Dyo  -  2Do,i  +  D_  12. 

Total  variation  distance  The  total  variation  distance  (tv)  is  such  an  important  distance  that  it 
is  sometimes  referred  to  simply  as  “the  statistical  distance.”  It  can  be  defined  as 


tv(P,0  =  sup|P(A)  -  Q(A) I, 


A 


where  A  ranges  over  every  event  in  the  underlying  cr- algebra.  It  can  also  be  represented  as 
\L\(P,  Q),  as  an  /-divergence  with  t  1— »  \t  -  1|,  and  as  an  ipm  with  (among  other  classes) 
§  =  {/  :  sup  xex  f(x)  ~  ini'xs/Y  fix)  <  1}  (Muller  1997).  Note  that  tv  is  a  metric,  and 
0  <  tv(P,  Q)  <  1. 

The  total  variation  distance  is  closely  related  to  the  “intersection  distance”,  most  commonly 
used  on  histograms  (Cha  and  Srihari  2002): 


Kullback-Leibler  divergence  The  Kullback-Leibler  (kl)  divergence  is  defined  as 


For  discrete  distributions,  the  kl  divergence  bears  a  natural  information  theoretic  interpretation 
as  the  expected  excess  code  length  required  to  send  a  message  for  P  via  the  optimal  code  for  Q.  It 
is  nonnegative,  and  zero  iff  P  =  Q  almost  everywhere;  however,  kl(P\\Q)  ±  kl(Q\\P)  in  general. 
Note  also  that  if  there  is  any  point  with  p(x)  >  0  and  q(x )  =  0,  kl(P||<2)  =  00. 

Applications  often  use  a  symmetrization  by  averaging  with  the  dual: 


skl(P,  Q )  :=  \  (kl(P||<2)  +  kl(2||P))  . 


This  is  also  sometimes  called  Jeffrey’s  divergence,  though  that  name  is  also  sometimes  used  to 
refer  to  the  Jensen-Shannon  divergence  (below),  so  we  avoid  it.  skl  does  not  satisfy  the  triangle 
inequality. 

kl  can  be  viewed  as  a  /  divergence,  with  one  direction  corresponding  to  f  1— >  f  log  t  and  the 
other  to  t  1— >  -  log  t;  skl  is  thus  an  /  divergence  with  t  1— >  \{t  -  1)  log  t. 
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Jensen- Shannon  divergence  The  Jensen-Shannon  divergence  is  based  on  kl: 


JS  (P,Q) 


+  |kl 


P  +  Q 
2 


where  denotes  an  equal  mixture  between  P  and  Q.  js  is  clearly  symmetric,  and  in  fact  \fls 
satisfies  the  triangle  inequality.  Note  also  that  0  <  js(/J,  Q)  <  log  2.  It  gets  its  name  from  the  fact 
that  it  can  be  written  as  the  Jensen  difference  of  the  Shannon  entropy: 


js(P,Q)  =  H 


P  +  Q 
2 


g[P]  +  H[g] 

2 


a  view  which  allows  a  natural  generalization  to  more  than  two  distributions.  Non-equal  mixtures 
are  also  natural,  but  of  course  asymmetric.  For  more  details,  see  e.g.  Martins  et  al.  (2009). 


Renyi  -a  divergence  The  Renyi-cr  divergence  (Renyi  1961)  generalizes  kl  as 

Ra(^IIG)  :=  — ~ “r  l°g  [  p{x)a q(x)l~a  dx; 

o'  -  1  J 

note  that  r0(P\\Q)  =  kl(/j||(9),  though  a  =  1  is  not  defined.  r„  is  typically  used  for 

a  6  (0, 1)  U  (1,  oo);  for  a  <  0,  it  can  be  negative.  Like  kl,  r„  is  asymmetric;  we  similarly  define 
a  symmetrization 

S  *a(P,Q):=  k(Ra(P\\Q)  +  Ra(Q\\P))- 

sr„  does  not  satisfy  the  triangle  inequality. 

Rq,  can  be  represented  based  on  an  a-(S  divergence:  r(R||2)  =  log  Da^\^a{P\\Q). 

A  Jensen-Renyi  divergence,  defined  by  replacing  kl  with  r„,  in  the  definition  of  js,  has  also 
been  studied  (Martins  et  al.  2009),  but  we  will  not  consider  it  here. 


Tsallis  -a  divergence  The  Tsallis-a  divergence,  named  after  Tsallis  (1988)  but  previously  stud¬ 
ied  by  Havrda  and  Charvat  (1967)  and  Daroczy  (1970),  provides  a  different  generalization  of 
kl: 

ra(P\\Q)  ■■=  |  J  pixyqix)1- Ax  -  lj . 

Again,  lima^i  xa{P\\Q)  =  kl(P||0,  and  of  course  Ta  =  ^  {Da-^i-a(P\\Q)  -  l).  Because  of 
its  close  relation  to  r((.,  we  will  not  use  it  further. 


Hellinger  distance  The  square  of  the  Hellinger  distance  h  is  defined  as 

h  2(P,  Q)  :=  \  J  (VrO)  -  V^))  dx  =  l-  J  VpM  <?(x)  dx. 

h2  can  be  expressed  as  an  /-divergence  with  either  t  h  ^(Vt-l)2orth-^  1-  Qt\  it  is  also 
closely  related  to  an  a-/3  divergence  as  \r(P,  Q)  =  1  -  D_ i/2,i/2-  h  is  a  metric,  and  is  bounded 
in  [0,1].  It  is  proportional  to  the  Lo  difference  between  Qp  and  Qq,  which  yields  the  bounds 
h2(F,  Q)  <  tv(P,  Q)  <  V2h(P,  Q). 
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X1  divergence  There  are  many  distinct  definitions  of  the  x2  divergence.  We  do  not  directly  use 
any  in  this  thesis,  but  the  most  common  versions  are: 


X2p(P\\Q )  :=  J 

U)  - ') q{x)ix = 

r  (p(x)  -  q(x))2 

J  q(x) 

r;>-' 

X2N{P\\Q)  :=  J 

'  (p(x)  -  dx_  1 

p(x)  '  J 

'  q(f,  ^  - 1 

p{x) 

x](P,Q )  -=\ 

f  (p(x)  -  d(x))2  d  . 

J  p(x)  +  q(x ) 

X\{P,Q)  :=  2  | 

j  -  f 

v  J  P(x)  +  q(x)  ) 

Xp(P\\Q)  is  an  /-divergence  using  either  t  i— »  (t  -  l)2  or  t  h->  t2  -  1,  used  e.g.  by  Liese  and  Vajda 
(2006);  it  is  sometimes  called  the  Pearson  divergence  or  similar,  and  is  often  used  in  hypothesis 
testing  of  multinomial  data.  Xn(P\\Q)>  termed  the  Neyman  divergence  e.g.  by  Cressie  and  Read 
(1984),  is  its  dual:  x^(P\\ Q)  -  Xp(Q\\P)-  Neither  is  commonly  used  in  learning  on  distributions. 

Xs(P>  Q )  is  a  symmetric  variant  of  these  distances;  its  use  on  discrete  distributions,  especially 
histograms,  is  common  in  computer  vision  (Puzicha  et  al.  1997;  Zhang  et  al.  2006). 

Vedaldi  and  Zisserman  (  2)  use  the  kernel  kxi(P,  Q)  :=  2  J  cbt,  sometimes  called 

the  additive  x2  kernel  (e.g.  by  Grisel  et  al.  2016),  which  corresponds  to  the  distance  x\-  Despite 
a  claim  to  the  contrary  by  Vedaldi  and  Zisserman  (2012),  it  is  not  equal  to  xj- 

Earth  mover’s  distance  The  earth  mover’s  distance  (emdp)  is  defined  for  a  metric  p  as 

emd  p{P,Q):=  inf  E{XJhR[p(X,Y)\,  (2.1) 

where  T(P,  Q)  is  the  set  of  joint  distributions  with  marginals  P  and  Q.  It  is  also  called  the  first 
Wasserstein  distance,  or  the  Mallows  distance.  When  (V,p)  is  separable  (in  the  topological 
sense),  it  is  also  equal  to  the  Kantorovich  metric,  which  is  the  ipm  with  g  =  {/  :  \\/\\l  <  1 }, 
where  ||/||l  :=  sup  {|  f(x)  -  f(y)\/p(x,y )  |  jc  4=  y  e  V}  is  the  Lipschitz  semi-norm.  Edwards 
(201 1)  gives  some  historical  details  and  proves  the  equality  in  a  more  general  setting. 

For  discrete  distributions,  emd  can  be  computed  via  linear  programming,  and  is  popular  in 
the  computer  vision  community  (e.g.  Rubner  et  al.  2000;  Zhang  et  al.  2006). 

Cuturi  ( ’01  3)  proposes  a  distance  called  the  Sinkhorn  distance,  which  replaces  T(P,  Q)  in 
(2.1)  with  a  constraint  that  the  kl  divergence  of  the  distribution  from  the  independent  be  less 
than  some  parameter  a.  This  both  allows  for  much  faster  computation  of  the  distance  on  discrete 
distributions  and,  in  certain  problems,  yields  learning  models  that  outperform  those  based  on  the 
full  EMD. 

Maximum  mean  discrepancy  The  maximum  mean  discrepancy,  called  the  mmd  (Sriperum- 
budur,  Gretton,  et  al.  2010;  Gretton,  Borgwardt,  et  al.  2012)  is  defined  by  embedding  distributions 
into  a  reproducing  kernel  Hilbert  space  (rkhs;  for  a  detailed  overview  see  Berlinet  and  Thomas- 
Agnan  2004).  Let  k  be  the  kernel  associated  with  some  rkhs  dd  with  feature  map  ip  :  X  — »  dd. 
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also  denoted  ip{x)  =  k(x,  •),  such  that  ( </?(.*),  ip(y))<H  =  k(x,  y).  We  can  then  map  a  distribution  P 
to  its  mean  embedding  iu^{P)  :=  E x~p  [ip(X)\,  and  define  the  distance  between  distributions  as 
the  distance  between  their  mean  embeddings: 

mmd,.(P,  Q)  :=  || p<n{P)  ~  i*h(Q) \[h- 

mmDk  can  also  be  viewed  as  an  ipm  with  g  =  {/  6  <H  |  \\f\\<H  <  1},  where  ||/||>h  is  the 
norm  in  fH.  (If  /  6  ‘H,  /(•)  =  aiK(xh  •)  for  some  points  x,  e  X  and  weights  a,  e  R; 
WfW^i  =  hij  ai(Xjk(xi,  Xj).)  In  fact,  the  function  /  achieving  the  supremum  is  known  as  the 
witness  function,  and  is  achieved  by  /  =  /uP  -  /uq. 

The  mean  embedding  always  exists  when  the  base  kernel  k  is  bounded,  in  which  case  mmd* 
is  a  pseudometric;  full  metricity  requires  a  characteristic  k.  See  Sriperumbudur,  Gretton,  et  al. 
(20 1  ( ))  and  Gretton,  Borgwardt,  et  al.  (2012)  for  details. 

Szabo  et  al.  (2015)  proved  learning-theoretic  bounds  on  the  use  of  ridge  regression  with  mmd. 

2.2  Estimators  of  distributional  distances 

We  now  discuss  methods  for  estimating  different  distributional  distances  p. 

The  most  obvious  estimator  of  most  distributional  distances  is  perhaps  the  plug-in  approach: 
first  perform  density  estimation,  and  then  compute  distances  between  the  density  estimates.  These 
approaches  suffer  from  the  problem  that  the  density  is  in  some  sense  a  nuisance  parameter  for 
the  problem  of  distance  estimation,  and  density  estimation  is  quite  difficult,  particularly  in  higher 
dimensions. 

Some  of  the  methods  below  are  plug-in  methods;  others  correct  a  plug-in  estimate,  or  use 
inconsistent  density  estimates  in  such  a  way  that  the  overall  divergence  estimate  is  consistent. 

Parametric  models  Closed  forms  of  some  distances  are  available  for  certain  distributions: 

•  For  members  of  the  same  exponential  family,  closed  forms  of  the  Bhattacharyya  kernel 
(corresponding  to  Hellinger  distance)  and  certain  other  kernels  of  the  form  Da-\>a  were 
computed  by  Jebara  et  al.  (2004).  Nielsen  and  Nock  (201  )  give  closed  forms  for  all 
Da-\  \-a,  allowing  the  computation  of  r„,  t0.,  and  related  divergences,  as  well  as  the  kl 
divergence  via  lim(K^i  Da- i,i-a. 

•  For  Gaussian  distributions,  Muandet,  Scholkopf,  et  al.  (2012)  compute  the  closed  form  of 
mmd  for  a  few  base  kernels.  Sutherland  (2015)  also  conjectures  a  form  for  the  Euclidean 
emd  and  gives  bounds. 

•  For  mixture  distributions,  L2  and  mmd  can  be  computed  based  on  the  inner  products  between 
the  components  by  simple  linearity  arguments.  For  mixtures  specifically  of  Gaussians,  F. 
Wang  et  al.  (2009)  obtain  the  quadratic  (R2)  entropy,  which  allows  the  computation  of 
Jensen-Renyi  divergences  for  a  =  2. 

For  cases  when  a  closed  form  does  not  exist,  numerical  integration  may  be  necessary,  often 
obviating  the  computational  advantages  of  this  approach. 

It  is  thus  possible  to  fit  a  parametric  model  to  each  distribution  and  compute  distances  between 
the  fits;  this  is  done  for  machine  learning  applications  e.g.  by  Jebara  et  al.  (2004)  and  Moreno 
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(a)  The  example  densities  being  considered. 


0.0  0.2  0.4  0.6  0.8  1.0  0.0  0.2  0.4  0.6  0.8  1.0 

(b)  The  functions  being  integrated  for  some  of  the  distances.  For  example,  the  tv  image  shows  \ \p(x)-q(x)\. 
Figure  2.1:  An  illustration  of  some  of  the  distributional  distances  considered  here. 
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et  al.  (2004).  In  practice,  however,  we  rarely  know  that  a  given  parametric  family  is  appropriate, 
and  so  the  use  of  parametric  models  introduces  unavoidable  approximation  error  and  bias. 

Histograms  One  common  method  for  representing  distributions  is  the  use  of  histograms;  many 
distances  p  are  then  simple  to  compute,  typically  in  0{m)  time  for  m-bin  histograms.  The 
prominent  exception  to  that  is  emd,  which  requires  0(m 3  log  m)  time  for  exact  computation  (e.g. 
Rubner  et  al.  2000),  though  in  some  settings  0(m)  approximations  are  available  (Shirdhonkar 
and  Jacobs  2008)  and  as  previously  mentioned,  the  related  Sinkhorn  distance  can  be  computed 
quite  quickly  (Cuturi  2013).  mmd  also  requires  approximately  0(m2)  computation  for  typical 
histograms. 

The  main  disadvantages  of  histograms  are  their  poor  performance  in  even  moderate  dimen¬ 
sions,  and  the  fact  that  (for  most  ps)  choosing  the  right  bin  size  is  both  quite  important  and  quite 
difficult,  since  nearby  bins  do  not  affect  one  another.  Histogram  density  estimators  also  give 
non-optimal  rates  for  density  estimation  (Wasserman  2006),  and  provide  technical  difficulties  in 
establishing  consistent  estimation  as  bin  sizes  decrease  (Gretton  and  Gyorfi  2010). 

Vector  quantization  An  improvement  over  standard  histograms,  popular  in  computer  vision, 
is  to  instead  quantize  distributions  to  group  points  by  their  nearest  codeword  from  a  dictionary, 
often  learned  via  k-means  or  a  similar  algorithm.  This  method  is  known  as  the  bag  of  words  (bow) 
approach  and  was  popularized  by  Leung  and  Malik  (2001).  This  method  empirically  scales  to 
much  higher  dimensions  than  the  histogram  approach,  but  suffers  from  similar  problems  related 
to  the  hard  assignment  of  sample  points  to  bins. 

Grauman  and  Darrell  (  007)  use  multiple  resolutions  of  histograms  to  compute  distances, 
helping  somewhat  with  the  issue  of  choosing  bin  sizes. 

Kernel  density  estimation  Perhaps  the  most  popular  form  of  general-purpose  nonparametric 
density  estimation  is  kernel  density  estimation  (kde).  kde  results  in  a  mixture  distribution,  which 
allow  0{n2)  exact  computation  of  plug-in  mmd  and  L2  for  certain  density  kernels.  Selection  of 
the  proper  bandwidth,  however,  is  a  significant  issue. 

Singh  and  Poczos  (201  )  show  exponential  concentration  for  a  particular  plug-in  estimator 
for  a  broad  class  of  functionals  including  Lp,  Da 5jg,  and  /-divergences  as  well  as  js,  though 
they  do  not  discuss  computational  issues  of  the  estimator,  which  in  general  requires  numerical 
integration. 

Krishnamurthy  et  al.  ( 20 1  )  correct  a  plug-in  estimator  for  L2  and  r0  divergences  by  estimating 
higher  order  terms  in  the  von  Mises  expansion;  one  of  their  estimators  is  computationally  attractive 
and  optimal  for  smooth  distributions,  while  another  is  optimal  for  a  broader  range  of  distributions 
but  requires  numerical  integration. 

k-NN  density  estimator  The  A-nn  density  estimator  provides  the  basis  for  another  family  of 
estimators.  These  estimators  require  A-ncarcst  neighbor  distances  within  and  between  the  sample 
sets.  Much  research  has  been  put  into  data  structures  for  efficient  approximate  nearest  neighbor 
computation  (e.g.  Beygelzimer  et  al.  2006;  Muja  and  Lowe  2009;  Andoni  and  Razenshteyn  2015; 
Naidan  et  al.  2015),  though  in  high  dimensions  the  problem  is  quite  difficult  and  brute-force 
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pairwise  computation  may  be  the  most  efficient  method.  Plug-in  methods  require  k  to  grow 
with  sample  size  for  consistency,  typically  at  a  rate  around  \fTu  which  makes  computation  more 
difficult. 

Q.  Wang  et  al.  (2009)  give  a  simple,  consistent  k- nn  kl  divergence  estimator.  Poczos  and 
Schneider  (201  )  give  a  similar  estimator  for  Dq._ij_q.  and  show  consistency;  Poczos,  Xiong, 
Sutherland,  et  al.  (2012)  generalize  to  Da^.  This  family  of  estimators  is  consistent  with  a  fixed 
k ,  though  convergence  rates  are  not  known. 

Moon  and  Hero  ( ’014a, 1 )  propose  an  /-divergence  estimator  based  on  ensembles  of  plug¬ 
in  estimators,  and  show  the  distribution  is  asymptotically  Gaussian.  (Their  estimator  requires 
neither  convex  /  nor  /( 1)  =  0.) 


Mean  map  estimators  A  natural  estimator  of  p<n(Q))tH  is  simply  the  mean  of  the 

pairwise  kernel  evaluations  between  the  two  sets,  X-  Y!j=\  X” Li  K(  Xj,  Yj);  this  is  the  inner  prod¬ 
uct  between  embeddings  of  the  empirical  distributions  of  the  two  samples.  The  estimator 
h  Z'L,  x(Xi,  Yj)  allows  use  in  the  streaming  setting.  We  can  then  estimate  mmd  via  ||.r  -  y||2  = 
(x,  x )  +  (y,  y)  -  2(x,  y)  (Gretton,  Borgwardt,  et  al.  2012).  Section  6.1  gives  much  more  detailed 
on  variations  of  these  estimators  of  mmd. 

Muandet,  Fukumizu,  et  al.  (2014)  proposed  biasing  the  estimator  of  mmd  to  obtain  smaller 
variance  via  the  idea  of  Stein  shrinkage  (1956).  Ramdas  and  Wehbe  (2015)  showed  the  efficacy 
of  this  approach  for  independence  testing. 


Other  approaches  Nguyen  et  al.  (2010)  provide  an  estimator  for  /-divergences  (requiring 
convex  /  but  not  /(l)  =  0)  by  solving  a  convex  program.  When  an  rkhs  structure  is  imposed,  it 
requires  solving  a  general  convex  program  with  dimensionality  equal  to  the  number  of  samples, 
so  the  estimator  is  quite  computationally  expensive. 

Sriperumbudur  et  al.  (2012)  estimate  the  Li-emd  via  a  linear  program. 

K.  Yang  et  al.  (201  )  estimate  /-  and  r(I,  divergences  by  adaptively  partitioning  both  distribu¬ 
tions  simultaneously.  Their  Bayesian  approach  requires  mcmc  and  is  computationally  expensive, 
though  it  does  provide  a  posterior  over  the  divergence  value  which  can  be  useful  in  some  settings. 


2.3  Kernels  on  distributions 

We  consider  two  methods  for  defining  kernels  based  on  distributional  distances  p.  Proposition  1  of 
Haasdonk  and  Bahlmann  (2004)  shows  that  both  methods  always  create  positive  definite  kernels 
iff  p  is  isometric  to  an  L 2  norm,  i.e.  there  exist  a  Hilbert  space  ‘H  and  a  mapping  <E>  .  X  —>  ‘H 
such  that  p(P,  Q)  =  ||®(P)  -  ®(<2)||.  Such  metrics  are  also  called  Hilbertian.' 

For  distances  that  do  not  satisfy  this  property,  we  will  instead  construct  an  indefinite  kernel 
as  below  and  then  “correct”  it,  as  discussed  in  Section  2.4.1. 

1  Note  that  if  p  is  Hilbertian,  Proposition  1  (ii)  of  Haasdonk  and  Bahlmann  (2004)  shows  that  -p2P  is  conditionally 
positive  definite  for  any  0  <  /3  <  1;  by  a  classic  result  of  Schoenberg  ( [938),  this  implies  that  pb  is  also  Hilbertian. 
We  will  use  this  fact  later. 
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The  first  method  is  to  create  a  “linear  kernel”  k  such  that  p2(P,  Q )  =  k(P,  P)  +  k(Q,  Q )  - 
2 k(P,  Q ),  so  that  the  rkhs  with  inner  product  k  has  metric  p.  Note  that,  while  distances  are 
translation-invariant,  inner  products  are  not;  we  must  thus  first  choose  some  origin  B.  Then 

C (P,  Q )  :=  i  (p2(P,  B)  +  p\Q,  B )  -  p\p ,  2))  (2.2) 

is  a  valid  kernel  for  any  B  iff  p  is  Hilbertian.  If  p  is  defined  for  the  zero  measure,  it  is  often  most 
natural  to  use  that  as  the  origin;  in  the  cases  it  is  used  below,  it  is  easy  to  verify  that  is  a  valid 
kernel  inducing  the  relevant  distance  despite  issues  of  whether  p(P,  0)  is  defined. 

We  can  also  use  p  in  a  generalized  rbf  kernel :  for  a  bandwidth  parameter  cr  >  0, 

kml(x,  y )  :=  exp  <?)j .  (2.3) 


The  L,2  distance  is  clearly  Hilbertian;  ky^(P,  Q)  =  f  p(x)q(x )  dx. 

Fuglede  (2005)  shows  that  Vtv,  h,  and  a/js  are  Hilbertian.2 

•  For  y/rv,  k^(P,  Q)  -  \  (1  -  tv(P,  Q ))  since  tv(P,  0)  =  ^\\P\\i  = 

•  For  h,  k^(P,  Q)  =  1  -  \  h 2(P,  Q)  =  \  +  J  ^p(x)  q(x)dx,  but  the  halved  Bhattacharyya 
affinity  k(P,  Q)  =  f  yjp(x)  q(x)dx  is  more  natural. 


ForV^,  C(/5’2)  =  UH[f^]+H 


Q+O 

2 

■  (0)/ 


-H 


H[0]). 


Topspe  (  0)  shows  that  xs  is  Hilbertian;  k'^(P,  Q)  =  1  -  |^(P,  Q ).  The  computer  vision 

community  sometimes  uses  as  a  kernel  simply  ~x2(P,  Q ),  which  is  only  conditionally  positive 
definite  Zhang  et  al.  (2006).  xa  is  also  Hilbertian,  as  shown  by  Vedaldi  and  Zisserman  (2012) 
using  the  result  of  Fuglede  (2005). 

Gardner  et  al.  (2015)  show  that  emd  is  Hilbertian  for  the  unusual  choice  of  ground  metric 
p(x,  y )  =  1(jc  ■+  y).  emd  is  probably  not  Hilbertian  in  most  cases  for  Euclidean  base  distance: 
Naor  and  Schechtman  (2007)  prove  that  Euclidean  emd  on  distributions  supported  on  a  grid  in 
R2  does  not  embed  in  L\,  which  since  Lz  embeds  into  L\  (Bretagnolle  et  al.  1966)  means  that 
emd  on  that  grid  does  not  embed  in  Lo.  It  is  thus  extremely  likely  that  this  also  implies  Z^-emd 
on  continuous  distributions  over  Rd  for  d  >  2  is  not  Hilbertian.  The  most  common  kernel  based 
on  emd,  however,  is  actually  exp  (-y  emd(P,  Q)).  Whether  that  kernel  is  positive  definite  seems 
to  remain  an  open  question,  defined  by  whether  VEMD  is  Hilbertian;  studies  that  have  used  it  in 
practice  have  not  reported  finding  any  instance  of  an  indefinite  kernel  matrix  (Zhang  et  al.  2006). 

The  mmd  is  Hilbertian  by  definition.  The  natural  associated  linear  kernel  is  k^(P,Q)  = 
(p<jq(P),  P'm(Q))'H^  which  we  term  the  mean  map  kernel  (mmk). 


2.4  Kernels  on  sample  sets 

As  discussed  previously,  in  practice  we  rarely  directly  observe  a  probability  distribution;  rather, 
we  observe  samples  from  those  distributions.  We  will  instead  construct  a  kernel  on  sample  sets, 

2See  his  Theorem  2.  For  yjrv,  use  Kl>?  i;  for  h,  use  K  i .  For  Vis,  differentiate  Kp  i  around  p  =  1,  following  the 

*’2 

note  after  the  theorem. 
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based  on  an  estimate  of  a  kernel  on  distributions  using  an  estimate  of  the  base  distance  p. 

We  wish  to  estimate  a  kernel  on  N  distributions  based  on  an  iid  sample  from  each 

distribution  where  X ^  X^  e  M.d .  Given  an  estimator  p{X^l\X^)  of 

p(P( '),  P^),  we  estimate  k(P,-,  Pj)  with  k(X(l),  X^)  by  substituting  p(X(l\  X^)  for  p(P('\  P^)  in 
(2.2)  or  (2.3).  We  thus  obtain  an  estimate  K  of  the  true  kernel  matrix  K,  where  K,  =  k(X[lk  X(^). 


2.4.1  Handling  indefinite  kernel  matrices 


Section  2.3  established  that  K  is  positive  semidefinite  for  many  distributional  distances  p,  but  for 
some,  particularly  skl  and  sr0,  K  is  indefinite.  Even  if  K  is  psd,  however,  depending  on  the  form 
of  the  estimator  K  is  likely  to  be  indefinite. 

In  this  case,  for  many  downstream  learning  tasks  we  must  modify  K  to  be  positive  semidefinite. 
Chen  et  al.  (2009)  study  this  setting,  presenting  four  methods  to  make  K  psd: 

•  Spectrum  clip:  Set  any  negative  eigenvalues  in  the  spectrum  of  K  to  zero.  This  yields  the 
nearest  psd  matrix  to  K  in  Frobenius  norm,  and  corresponds  to  the  view  where  negative 
eigenvalues  are  simply  noise. 

•  Spectrum  flip:  Replace  any  negative  eigenvalues  in  the  spectrum  with  their  absolute  value. 


•  Spectrum  shift:  Increase  each  eigenvalue  in  the  spectrum  by  the  magnitude  of  the  smallest 
eigenvalue,  by  taking  K  +  |/imjn|/.  When  |/tmin|  is  small,  this  is  computationally  simpler  -  it 
is  easier  to  find  Tmin  than  to  find  all  negative  eigenvalues,  and  requires  modifying  only  the 
diagonal  elements  —  but  can  change  K  more  drastically. 

•  Spectrum  square:  Square  the  eigenvalues,  by  using  KKT .  This  is  equivalent  to  using  the 
kernel  estimates  as  features. 

We  denote  this  operation  by  II. 

When  test  values  are  available  at  training  time,  i.e.  in  a  transductive  setting,  it  is  best  to 
perform  these  operations  on  the  full  kernel  matrix  containing  both  training  and  test  points:  that 

Strain  Strain, test 


is,  to  use  II 


Kt 


test,  train 


Kt 


test 


(Note  that  Klcsl  is  not  actually  used  by  e.g.  an  svm.)  If  the 


changes  are  performed  only  on  the  training  matrix,  i.e.  using 


n  ^  Strain  j 

Strain, test 

^test,  train 

^Hest 

,  which  is 


necessary  in  the  typical  inductive  setting,  the  resulting  full  kernel  matrix  may  not  be  psd,  and  the 
kernel  estimates  may  be  treated  inconsistently  between  training  and  test  points.  This  is  more  of 
an  issue  for  a  truly-indefinite  kernel,  e.g.  one  based  on  kl  or  r(>,  where  the  changes  due  to  II  may 
be  larger. 

When  the  test  values  are  not  available,  Chen  et  al.  (2009)  propose  a  heuristic  to  account  for 
the  effect  of  II:  for  spectrum  clip  and  flip,  they  find  the  linear  transformation  which  maps  A'train 
to  Il(k'train),  based  on  the  eigendecomposition  of  Strain.  and  apply  it  to  train-  That  is,  they 
find  the  P  such  that  Il(^train)  =  PK train  as  follows:  let  the  eigendecomposition  of  Strain  be  UAUJ, 
with  eigenvalues  denoted  Ti, . . . ,  Then  P  is  U MU1 ,  with  M  defined  as: 


Mflip  :  =  diag(sign(di ),...,  sign^))  (2.4) 

Mdip  :=  diag(l(Ti  >  0), . . .,  1(Tjv  >  0)). 


15 


For  spectrum  shift,  no  such  linear  transform  is  available,  but  it  is  easy  to  account  for  the  effect  of 
II:  simply  add  |Imin|/  to  Ktest  as  well. 

In  general,  we  find  that  the  transductive  method  is  better  than  the  heuristic  approach,  which 
is  better  than  ignoring  the  problem,  but  the  size  of  these  gaps  is  problem-specific:  for  some 
problems,  the  gap  is  substantial,  but  for  others  it  matters  little. 

When  performing  bandwidth  selection  for  a  generalized  Gaussian  rbf  kernel,  this  approach 
requires  separately  eigendecomposing  each  Strain-  Xiong  (2013,  Chapter  6)  considers  a  differ¬ 
ent  solution:  rank-penalized  metric  multidimensional  scaling  according  to  p,  so  that  standard 
Gaussian  rbf  kernels  may  be  applied  to  the  embedded  points.  That  work  does  not  consider 
the  inductive  setting,  though  an  approach  similar  to  that  of  Bengio  et  al.  (2004)  is  likely  to  be 
applicable. 


2.4.2  Nystrom  approximation 


When  N  is  large,  computing  and  operating  on  the  full  NxN  kernel  matrix  can  be  quite  expensive: 
many  kernel  entries  must  be  computed  and  stored  (or  else  re-computed,  at  significant  cost  per 
entry),  and  many  learning  techniques  as  well  as  the  techniques  to  account  for  indefiniteness  in 
the  kernel  estimate  require  0(N 3)  work. 

One  method  for  approaching  this  problem  is  the  Nystrom  extension  (Williams  and  Seeger 
2000).  In  this  method,  we  somehow  pick  m  <  N  anchor  points,  perhaps  by  uniform  random 
sampling  or  by  approximate  leverage  scores  (El  Alaoui  and  Mahoney  2015).  Reordering  the 


kernel  matrix  so  that  these  m  points  come  first,  let  the  kernel  matrix  be 


A 

BJ 


B 

C 


where  A  is  the 


m  x  m  kernel  matrix  of  the  anchor  points,  B  is  the  m  x(N  -  m)  matrix  of  kernel  values  from  the 
anchor  points  to  all  other  points,  and  C  is  the  ( N  -  m )  x  (N  -  in)  matrix  of  kernel  values  among 
the  other  points.  We  fully  evaluate  A  and  B,  but  leave  C  unevaluated;  our  goal  is  to  approximate 
it  assuming  that  the  matrix  is  low-rank. 


Standard  Nystrom  The  Nystrom  method  does  so  by  assuming  that  K  is  of  rank  m,  and  using  A 
as  the  eigenvalues  for  K,  while  approximating  the  N  -  m  unknown  eigenvectors  by  BJ  UA'.  where 
AT  denotes  the  Moore-Penrose  pseudoinverse  of  A.  (Here,  A  is  diagonal,  so  the  pseudoinverse 
coincides  with  the  standard  inverse  except  if  any  eigenvalues  are  zero.)  Thus  our  approximation 
of  K  is 


Q  ._  U 
U  -  [BJUA\ 

K  :=  UAU1 

UAUJ  UAA^UJB 
BJUA^AUJ  flT£/AtAAtt/Tfl 

A  AA^B 
BJA^A  BJAj‘B  ' 
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When  A  is  nonsingular,  A  A1  =  I  and  so  the  m  x  (N  -  in)  portions  of  the  kernel  matrix  are 
unchanged.  Otherwise, 

AAf  =  UAUJUA^UJ  =  U  diag(l(di  ±  0), . . 1  (Am  ±  0  ))UT 
and  so  B  is  projected  onto  the  image  of  A. 

_  t 

Explicit  m-dimcnsional  embeddings  for  the  training  points  are  then  available  as  U A2,  which 
can  then  be  used  in  training  models;  new  points  can  be  embedded  in  the  same  m -dimensional 
space  comparably.  For  a  recent  theoretical  analysis  of  the  effect  of  this  approximation  on  kernel 
ridge  regression,  see  Rudi  et  al.  (2015). 


Indefinite  Nystrom  via  svd  When  A  is  indefinite,  we  need  to  combine  the  Ny strom  approach 
with  some  type  of  projection  to  the  psd  cone,  as  in  Section  2.4.1.  Belongie  et  al.  (  :00^ )  give  an 
analagous  method  for  doing  so,  which  we  present  here: 


Let  the  singular  value  decomposition  of  A  be  USWDASVDVjWD.  Let  USWD 
the  Nystrom  reconstruction  be 


BJU< 


UsvD 

A 


t 

SVD 


,  and 


K* 


f^SVD  ASVD  f-^svn 


Csvn  Asv[)  Asv|)£/„vn  B 


Us\dASvDUsvd  tysvi)/ vsv[)/vsvnusvDJ 

B  £7SVD  ASVD  ASVDC7  B  UsvdAswdAsvdAswdUsvdB 


To  understand  this  approximation,  define  ApiP  :=  U  abs(A)  UT,  where  abs  denotes  taking  the 
elementwise  absolute  value:  this  is  the  “spectrum  flip”  method  of  Chen  et  al.  (  009).  Then, 
we  have  that  USWDASWDUjVD  =  Aflip.  First,  AAT  =  f/SVDA“VDf7jVD,  so  (as  singular  values  are 
nonnegative)  its  matrix  square  root  is  just  t/SVDASVDf/JVD.  We  also  have  that  AAT  =  U A2UJ .  so 
its  matrix  square  root  can  also  be  written  U  abs(A)  UT  =  Aflip.  Because  the  principal  square  root 
of  the  psd  matrix  AAT  is  unique,  USWDASWDUjWD  =  Aflip.  Thus 


Kov  n  — 


Afl 

jT/tt 


IP 


T  /it 


B  Apip AfliP  B  Af,iriB 


flip 


(2.5) 


l 

Explicit  /n-dimensional  features  are  again  available  as  f7SVDAs2vl). 

As  long  as  A  is  nonsingular,  AfljP  is  positive  definite,  and  the  m  x  (N  -  in)  evaluations  are 
unaffected.  Again,  if  A  is  singular  then  B  is  projected  onto  the  image  of  Aflip. 


Consistent  indefinite  Nystrom  The  last  approach  corresponds  to,  given  A  and  B,  taking  the 
Nystrom  approximation  with  AfliP  and  an  unmodified  B.  But  not  modifying  B  to  account  for  the 
psd  transformation  means  that,  if  a  point  from  the  m  inducing  points  were  repeated  in  the  N  -  m 
other  points,  it  would  be  treated  inconsistently.  We  could  assuage  this  problem  with  the  heuristic 
linear  transform  of  Chen  et  al.  (2009)  by  performing  the  Nystrom  approximation  based  on  Aflip 
and  PflipB  =  U M\]ipUJ B.  where  Afipp  was  defined  in  (2.4),  rather  than  ApiP  and  an  unmodified  B. 
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This  gives  a  reconstruction  of 


U] flip 


u 

U 

U 

(p»1pb)tc/a;1p 

BJUMmpUJUAlp 

BJUA f 

f^flipAflipf/pjp 

f/Aflipf/1  UAmpA*UJB 
BJ  U  A'1' Af[[pU  BTUA'At]ipA'UTB 


AfliP  /J|iip^ 
?T  p/  dT  i  f 


where  P'mp  =  U  diag  (sign(Ti), . . . ,  sign(/i,„))  UT,  which  is  the  same  as  Pflip  except  with  directions 
corresponding  to  zero  eigenvalues  zeroed  out.  Compared  to  the  AflipApip  used  in  the  equivalent 
place  in  (2.5),  this  flips  directions  corresponding  to  negative  eigenvalues  in  A.  The  m  x  m  known 
kernel  values  and  ( N  -  m )  x  (N  -  m)  unknown  kernel  values  are  the  same  as  in  (2.5). 

We  can  also  use  Aciip  and  Pc\\pB  to  produce  a  similar  ATciip- 

A  full  experimental  evaluation  of  these  approaches  to  Nystrom  extension  of  indefinite  kernels 
is  an  area  for  future  work. 
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Chapter  3 

Approximate  kernel  embeddings  via  ran¬ 
dom  Fourier  features 


As  discussed  in  Section  2.4.2,  the  kernel  methods  of  Chapter  2  share  a  common  drawback: 
solving  learning  problems  with  N  distributions  typically  requires  computing  all  or  most  of  the 
NxN  kernel  matrix.  Further,  many  of  the  methods  of  Section  2.4. 1  to  deal  with  indefinite  kernels 
require  eigendecompositions,  often  requiring  0(N 3)  work.  For  large  N,  this  quickly  becomes 
impractical. 

Section  2.4.2  gave  one  approach  for  countering  this  problem.  Rahimi  and  Recht  (  00'  ) 
spurred  recent  interest  in  another  method:  approximate  embeddings  z  :  X  — »  RD  such  that 
k(x,  y )  ~  z(.v)Tz(y).  Learning  primal  models  in  RD  using  the  z  features  can  then  usually  be 
accomplished  in  time  linear  in  n,  with  the  models  on  z  approximating  the  models  on  k. 

This  chapter  reviews  the  method  of  Rahimi  and  Recht  (2007),  providing  some  additional 
theoretical  understanding  to  the  original  analyses.  Chapter  4  will  apply  these  techniques  to  the 
distributional  setting. 


3.1  Setup 


Rahimi  and  Recht  (  007)  considered  continuous  shift-invariant  kernels  on  Rd ,  i.e.  those  that  can 
be  written  k(x,  y )  =  k( A),  where  we  will  use  A  :=  x  -  y  throughout.  In  this  case,  Bochner’s 
theorem  (  959)  guarantees  that  the  Fourier  transform  of  k  will  be  a  nonnegative  finite  measure 
on  Rd,  which  can  be  easily  normalized  to  a  probability  distribution.  Thus  if  we  define 


sin(m|x)  cos(cjJx) 


s\n(oPD/2x) 


COS  (cPp^x) 


~  nD/2  (3.1) 


and  let  s(x,  y)  :=  z(x)Tz(y),  we  have  that 


D/2 


(x,  y)  =  —  ^  sin (ojJx)  sin (tojy)  +  cos (tojx)  cos (coj y) 


i= t 


,  D/2 

—  gcos(^A). 


Noting  that  E  cos(cuTA)  =  J <Ke<,'JAl dQ(oj)  =  'KATA),  where  K  denotes  the  real  part,  we  therefore 
have  E  s(x,  y)  =  k(x,  y). 
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k  is  the  characteristic  function  of  Q,  and  $  the  empirical  characteristic  function  corresponding 
to  the  samples  {co/}. 

Rahimi  and  Recht  (2007)  also  alternatively  proposed 


z(x)  :=  —  [cos(o»[x  +  b\)  ...  cos(cl>jdx  +  bD)] 


{«,■}£,  ~  £2°,  {*,}", 

Letting  ,?(x,  y)  :=  2(x)T2(;y),  we  have 


D  l~  UnifD 
i=l  LJnlr  [0,2tt]  • 


(3.2) 


1  D  |  O 

s(x,  y)  =  —  ^  cos (ojJx  +  bi )  cos (cjjy  +  b/)  =  —  ^  cos (cj](x  -  y))  +  cos(cjJ(x  +  y)  +  2b/). 

i=  1  i=  1 

Let  t  :=  x  +  y  throughout.  Since  E  cos(mTi  +  2b)  =  Ew  [E*  cos(mTi  +  2b)\  =  0,  we  also  have 
Ei(r,  y)  =  k(x,  y). 

Thus,  in  expectation,  both  z  and  z  work;  they  are  each  the  average  of  bounded,  independent 
terms  with  the  correct  mean.  For  a  given  embedding  dimension  D,  z  is  the  average  of  ^  terms 
and  z  is  of  D.  but  each  component  of  z  has  lower  variance;  which  embedding  is  superior  is, 
therefore,  not  immediately  obvious. 

The  academic  literature  seems  split  on  the  issue.  In  Sutherland  and  Schneider  (2015),  we 
examined  the  first  100  papers  citing  Rahimi  and  Recht  (2007)  in  a  Google  Scholar  search:  15  used 
either  z.  or  the  equivalent  complex  formulation,  14  used  z,  28  did  not  specify,  and  the  remainder 
merely  cited  the  paper  without  using  the  embedding.  (None  discussed  that  there  was  a  choice 
between  the  two.)  Not  included  in  that  count  are  Rahimi  and  Recht’s  later  work  (2008a,  ),  which 
used  2;  indeed,  post-publication  revisions  of  the  original  paper  discuss  only  z.  Practically,  the 
three  implementations  of  which  we  are  aware  each  use  2:  scikit-learn  (Grisel  et  al.  2016),  Shogun 
(Sonnenburg  et  al.  2010),  and  JSAT  (Raff  201 1-16). 

We  will  show  that  2  is  strictly  superior  for  the  popular  Gaussian  kernel,  among  others.  We 
will  also  improve  the  uniform  convergence  bounds  of  Rahimi  and  Recht  (2007). 


3.2  Reconstruction  variance 

We  can  in  fact  directly  find  the  covariance  of  the  reconstructions: 

Cov  (5(A),  S(A'))  = 

so  that 

Var  5(A)  =  ^  [l  +  k( 2 A)  -  2k(A)2]  .  (3.4) 


—  Cov  ^cos(coTA),  cos(mTA/ 

cos  |mT(A  -  A')j  +  cos  ^mT(A  +  A' 
[k( A  -  A')  +  k( A  +  A')  -  2k(A)k(A')] , 


1 

D 

1 

D 


-2E 


cos  ( (oT A 


cos 


(„w) 


(3.3) 
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Similarly, 

Cov  ^cos(tuTA)  +  cos (u)Jt  +  2b),  cos(mTA')  +  cos(mV  +  2b)  j 
Cov  ^cos(mTA),  cos(mTA')j  +  Cov  ^cos(o/f  +  2b),  cos  (mV  +  2b)  j 

+  Cov  ^cos(mTA),  cos(mV  +  2b)  j  +  Cov  ^cos(mTf  +  2b),  cos(a»TA/)  j 

. . ^  ^ - ' 

0  0 

[\k{ A  -  A')  +  \k{ A  +  A')  -  k(A)k(A') 

+  ^  Ecos(wT(r  +  t')  +  4 b)  +  \  Ecos(coT(r  - 1'))] 

[\k{ A  -  A')  +  \k(A  +  A')  -  k(A)k(A')  +  \k(t  -  t')\ ,  (3.5) 

Var  s(x,  y)  =  -  f  1  +  \k( 2A)  -  k(A)2]  .  (3.6) 

D  ~ 

Thus  s  has  lower  variance  than  s  when  k( 2 A)  <  2k(A)2. 

Definition  3.1.  A  continuous,  shift-invariant  positive-definite  kernel  function  k(x,y)  =  kf  A)  with 

k_{ 0)  =  1  is  pixie  when  k(2A)  <  2k(A)2  for  all  A. 

Note  that  the  condition  always  holds  when  k( A)  >  since  positive-definiteness  and  k(0)  =  1 

require  k(-)  <  1.  It  also  trivially  holds  for  k(2A)  <  0.  Once  breaches  -^=  in  a  particular  direction, 
it  then  essentially  must  decay  at  least  exponentially. 1 

Proposition  3.2  (Exponentiated  norms).  Kernels  of  the  form  k_{  A)  =  cxp(-y||  A||^)/br  any  norm 
|| -||  and  scalars  y  >  0,  [3  >  1  are  pixie. 


Cov  (six,  y),  s(x',  y'))  =  ^ 

1 

~  D 


1 

~  D 
1 

~  D 

and  so 


Proof  Following  a  simple  calculation: 

2k(A)2  -  k(2A)  =  2 exp  (-y||A||^  -exp  (-y||2A||^ 

=  2  exp  (-2y||A||^  -  exp  (-2^y||A||^) 

>  2  exp  (-2y||A||^  -  exp  (-2r||A||^)  =  exp  (-2y||Af)  >o.  □ 


For  example,  the  Gaussian  kernel  uses  ||-||2  and  [3  =  2,  and  the  Faplacian  kernel  uses  ||-||i 
and  [3=1.  The  variance  per  dimension  of  embeddings  for  the  Gaussian  kernel  are  illustrated  in 
Figure  3.1. 

Proposition  3.3  (Matern  kernels).  Define  the  Matern  kernel  with  parameters  v  >  0  and  £  >  0  as 


2t-v 

:=  rw 


V2U|A|f 


K, 


V2v||A|| 


‘This  leads  to  the  obscure  name:  such  functions  can  have  a  “decreasing  base”  to  their  exponent,  which  might 
remind  one  of  the  song  “Debaser”  by  the  Pixies. 
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Figure  3.1:  The  variance  per  dimension  of  s  (blue)  and  s  (orange)  for  the  Gaussian  kernel  (green). 


where  Kv  is  a  modified  Bessel  function  of  the  second  kind.  kvj  is  pixie  for  all  v  >  \. 


Proof.  This  proof  is  due  to  SKB Moore  (2016).  First,  note  that  it  is  trivially  true  for  A  =  0.  Then 
define  x  :=  V2v|| A||//:'.  We  will  actually  show  the  stricter  inequality  k( 2A)  <  k( A)2,  which  is 
equivalent  to  saying  that  for  all  x  >  0: 


>1  — v 


i2-2v 


F(v) 


(2x)vKv(2x)  < 


fW/~VK2AX)’ 


i.e. 


>l-2v 


Ky  (2x)  < 


We  will  need  several  identities  about  Bessel  functions.  These  all  hold  for  any  x  >  0: 

ji ' 


k;(x) 


1  ri  _« 

-  /  -e  r 

2  Jo  t 


Kv\  —  \dt 


all  v;  (DLMF,  (10.32.18)  with  z  =  £  =  x)  (3.7) 


Ky(x)  >  2v-1T(v)e~x x~ 

,-2 


V  > 


1  . 


(Ismail  1990,  (1.4)) 


1  r°°  i 

Kv( 2x)  =  -xv  /  — -e  1  <  d t 

2  Jo  tv+1 

K-V(X)  =  Ky(X) 


(3.8) 

all  v;  (DLMF,  (10.32.10)  with  z  =  2x)  (3.9) 

all  v;  (DLMF,  (10.27.3))  (3.10) 


Note  that  Ismail  (  0)  shows  (3.8)  only  for  v  >  \,  but  it  holds  for  v  =  ^  as  well  by  a  trivial 


calculation,  since  Ki/2(x) 
We  have: 


(DLMF,  (10.39.2)). 


ol-2y  o 1  — 2v 

■  XVKy~(X )  - 


F(v) 


F(v) 


1  r°°  i 

2  Jo  t 


e~z  t~ Kv  \  —  d t 


(3.7) 
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> 


>l-2v 


r(v) 


1  r°°  i 

-  /  -e  5 

2  Jo  t 


JT 


2^r(v)«-T  (  2- 


dr 


(3.8) 


1  r  i 

=  2~v  -x~v  /  — 

2  J o  t  v 

1  -v  1 

=  -x  /  - re 

2  Jo  u-v+1 


t  2xl 


e  2  t  df 


jc^ 

-U-  — 


«  dw 


changing  variables  to  w  :=  if 

=  £_v(2x)  =  Kv(2x)  . 

(3.9)  (3.10) 

Note  that  (3.8)  does  not  hold  for  v  <  3  ,  and  in  fact  the  Matern  kernel  is  not  pixie  for  such  v.  □ 


3.3  Convergence  bounds 

Let  f(x,  y )  :=  s(x,  y)  -  k(x,  y),  and  f(x,  y)  :=  s(x,  y)  -  k(x,  y).  We  know  that  E  f(x,  y)  =  0  and 
have  a  closed  form  for  Var  f(x,  y),  but  to  better  understand  the  error  behavior  across  inputs,  we 
wish  to  bound  ||/||  for  various  norms. 


3.3.1  Li  bound 

If  /u  is  a  finite  measure  on  X  x  X  (/u(X2)  <  oo),  the  L2(X2,  n)  norm  of  /  is 

\\f\\l  ■=  f  f(x,y)2dv(x,y).  (3.11) 

Jx 2 


Proposition  3.4.  Let  k  be  a  continuous  shift-invariant  positive-definite  function  k  (x,  y)  =  k( A) 
defined  on  X  c  Wl,  with  k(0)  =  1.  Let  p  be  a  finite  measure  on  X2,  and  define  ||-||^  as  in  (3.11). 
Define  z  as  in  (3.1)  and  let  f(x,  y)  :=  z(x)Tz(y)  -  k(x,  y).  Then 
(i)  The  expected  squared  Lo  norm  of  the  error  is 

mf\\l  =  ^  J  [!  +  k(2x,2y)  -  2 k(x,y)2]  dp(x,y). 


(ii)  The  L2  norm  of  the  error  concentrates  around  its  expectation  at  least  exponentially: 


Pr(|||/||J-E||/||J|>£)  <2exp 


-D3s2  \ 

32(2D  +l)2p(X2)2)  ~  CXP 


-Ds2  \ 
2SSp(X2)2  j  ‘ 


Proposition  3.5.  Let  k,  p,  and  || - be  as  in  Proposition  3.4.  Define  z  as  in  (3.2)  and  let 
f(x,  y)  =  z(x)Jz(y)  -  k(x,  y).  Then 
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(i)  The  expected  squared  L2  norm  of  the  error  is 


Ell/ll; 


-hL 


1  +  -k(2x,  2 y)  -  k(x,yf 


d  p(x,y). 


(ii)  The  Lo  norm  of  the  error  concentrates  around  its  expectation  at  least  exponentially: 


Pr 


-DV 


-Ds1 


EH/U  >£)<  2 exP  ^  128(37)  +  2)2p(X2)2 )  ~  2 CXP  { 3200p(X2)2 )  ' 


The  proofs  for  these  propositions  are  simple  applications  of  Tonelli’s  theorem  and  McDiarmid 
bounds;  full  details  are  given  in  Appendices  B.l  and  B.2. 

Thus  for  the  kernels  considered  above,  the  expected  Lo (p)  error  for  z  is  less  than  that  of  z\  the 
comparable  concentration  bound  is  also  tighter.  The  second  inequality  is  simpler,  but  somewhat 
looser  for  D  »  1;  asymptotically,  the  coefficient  of  the  denominator  would  be  128  for  /  (instead 
of  288)  and  1 152  for  /  (instead  of  3200). 

Note  that  if  p  =  px  x  py  is  a  joint  distribution  of  independent  random  variables,  then 
E||/||J  =  [l  +  MMKk(piX,  P2Y)  ~  2  MMKkl(pX,  Py)\ 

Ell/112  =  ^  I1  +  2  MMKjk(/U2x, /i2y)  ~  MMK k2(pX,Ty)\  ■ 

Sriperumbudur  and  Szabo  (2015,  Corollary  2  and  Theorem  3)  subsequently  bounded  the 
deviation  of  /  in  the  L,  norm  for  any  r  e  [1,  00),  but  only  for  p  the  Lebesgue  measure.  Let  £ 
be  the  diameter  of  X  and  C  be  some  (unspecified)  universal  constant.  Then  their  bound  for  L2 
amounts  to,  for  s  large  enough  such  that  the  term  inside  the  parentheses  is  nonnegative, 


Pr  (II/IIl2w  >  e)  <  exp 


\2\ 

-C 


This  has  the  same  asymptotic  rate  in  terms  of  D  and  s  as  our  bound  but,  since  p(X2)  =  0(£2tl), 
has  better  dependence  on  £. 


3.3.2  High-probability  uniform  bound 

Claim  1  of  Rahimi  and  Recht  (  7)  is  that  if  A  c  ¥id  is  compact  with  diameter  If 

Pr(ll/IU  >  £)  <  256  (^) 

where  cr 2  =  E[c(jtojJ  =  tr  V2k( 0)  depends  on  the  kernel. 

It  was  not  necessarily  clear  in  that  paper  that  the  bound  applies  only  to  s  and  not  s;  we  can 
also  tighten  some  constants.  We  first  state  the  tightened  bound  for  z. 

2Note  our  D  is  half  that  in  Rahimi  and  Recht  (2007),  since  we  want  to  compare  embeddings  of  the  same  dimension. 
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Proposition  3.6.  Let  k  be  a  continuous  shift-invariant  positive-definite  function  k(x,y )  =  k( A) 
defined  on  X  c  Iid,  with  k( 0)  =  1  and  such  that  V2k(0)  exists.  Suppose  X  is  compact,  with 
diameter  £.  Denote  k_’s  Fourier  transform  as  Q(a>),  which  will  be  a  probability  distribution  due  to 
Bochner’s  theorem;  let  cr2  =Ep||a»||2.  Letzbeasin(3.1),  and  define  f(x,y)  :=  z(x)T  z(y)-k(x,  y). 
For  any  e  >  0,  let 


Thus,  we  can  achieve  an  embedding  with  pointwise  error  no  more  than  s  with  probability  at  least 
1  -  6  as  long  as 


D  > 


8  (cl  +  2  )a£ 


1  +  3 


log 


cr, 


Pu  ,  fid 

—  +  log  — 

S  0 


The  proof  strategy  is  very  similar  to  that  of  Rahimi  and  Recht  (200"):  place  an  e-net  with 
radius  r  over  Xa  :=  {x  -  y  :  x,y  e  X},  bound  the  error  /  by  e/2  at  the  centers  of  the  net  by 
Hoeffding’s  inequality  (1963),  and  bound  the  Lipschitz  constant  of  /,  which  is  at  most  that  of  s, 
by  e/(2r)  with  Markov’s  inequality.  The  introduction  of  ae  is  by  replacing  Hoeffding’s  inequality 
with  that  of  S.  Bernstein  (1924)  when  it  is  tighter,  using  the  variance  from  (3.4).  The  constant 
Ifi  is  obtained  by  exactly  optimizing  the  value  of  r,  rather  than  the  algebraically  simpler  value 
originally  used;  Pm  =  66  is  its  maximum,  and  lim,/_,oo  [ft  =  64,  though  it  is  lower  for  small  d,  as 
shown  in  Figure  3.2.  The  additional  hypothesis,  that  V2k(0)  exists,  is  equivalent  to  the  existence 
of  the  first  two  moments  of  P(co);  a  finite  first  moment  is  used  in  the  proof,  and  of  course  without 
a  finite  second  moment  the  bound  is  vacuous.  The  full  proof  is  given  in  Appendix  B.3. 

For  any  pixie  kernel,  ae  <  ^  the  Bernstein  bound  is  tighter  at  least  when  s  <  3.  (Recall 
that  the  maximal  possible  error  is  s  =  2,  so  it  is  essentially  always  preferable.)  For  the  Gaussian 
kernel  of  bandwidth  cr,  cr 2  =  d/cr2. 

For  z,  since  the  embedding  s  is  not  shift- invariant,  we  must  instead  place  the  e-net  on  X2. 
The  additional  noise  in  s  also  increases  the  expected  Lipschitz  constant  and  gives  looser  bounds 
on  each  term  in  the  sum,  though  there  are  twice  as  many  such  terms.  The  corresponding  bound 
is  as  follows: 

Proposition  3.7.  Let  k,  X,  t,  D(oj),  and  crp  be  as  in  Proposition  3.6.  Define  z  by  (3.2),  and 
f(x,  y)  :=  l(x)Tz(y)  -  k(x,  y).  For  any  e  >  0,  define 


a£  :=  mm 


,  sup 

x,y£X 


\  + 


■  k(2x,  2 y)  -  \k(x,  y)2  +  ^e  | 


-d  1 

dd+ 1  +  dd+i 


5d+l  _ 

2  d+ 1  3  rf+1 
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Figure  3.2:  The  coefficient  [3tj  of  Proposition  3.6  (blue,  for  z )  and  J3'd  of  Proposition  3.7  (orange, 
for  z ).  Rahimi  and  Recht  (2007)  used  a  constant  of  256  for  z. 


Then 


Pr 


Da2 

32 (d  +  IK 


< 


Da2  \ 
32  (d  +  1)  / 


w/ien  £  <  crpt. 


Thus,  we  can  achieve  an  embedding  with  pointwise  error  no  more  than  a  with  probability  at  least 
1  -  5  as  long  as 


D  > 


32  (d  +  IK 


2  ,  CTpe  ,  Pd 

- - r  log - +  log  — 

1+1  £  d 


g  =  98,  and  lim./^cx,  I3'd  =  96,  also  shown  in  Figure  3.2.  The  full  proof  is  given  in 
Appendix  B.4. 

For  any  kernel,  pixie  or  not,  the  Bernstein  bound  is  superior  for  any  s  <  7.5. 

Note  that  when  the  Bernstein  bound  is  being  used  for  a  typical  pixie  kernel,  a£  ~  2 ay. 
Although  we  cannot  use  these  bounds  to  conclude  that  ||/||oo  <  ll/IU,  the  fact  that  /  yields 
smaller  bounds  using  the  same  techniques  certainly  suggests  that  it  might  be  usually  true. 


3.3.3  Expected  max  error 

Noting  thatEH/lloo  =  ^“Prdl/Hoo  >  £)  d£,  one  could  consider  bounding  EH/Hoo  via  Proposi¬ 
tions  3.6  and  3.7.  Unfortunately,  that  integral  diverges  on  (0,  y)  for  any  y  >  0.  If  we  instead 
integrate  the  minimum  of  that  bound  and  1,  the  result  depends  on  a  solution  to  a  transcendental 
equation,  so  analytical  manipulation  is  difficult. 

We  can,  however,  use  a  slight  generalization  of  Dudley’s  entropy  integral  (  967)  to  obtain  the 
following  bound: 

Proposition  3.8.  Let  k,  X,  £,  and  fKco)  be  as  in  Proposition  3.6.  Define  z.  by  (3.1),  and 
f(x,  y)  :=  z(jc)Tz(y)  -  k(x,  y).  Let  Xa  :=  {x  -  y  \  x,  y  e  A};  suppose  k  is  L-Lipschitz  on  X\.  Let 
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R  :=  Emax-  ,  «  I 

i-i,...,  2 


Cl); 


Then 


n  24  y<dt.  ^ 

A  <  '  (*  +  £) 

Vd 


where  y  «  0.964. 

The  proof  is  given  in  Appendix  B.5.  In  order  to  apply  the  method  of  Dudley  ( i967),  we 
must  work  around  ||a»,  ||  (which  appears  in  the  covariance  of  the  error  process)  being  potentially 
unbounded.  To  do  so,  we  bound  a  process  with  truncated  ||m,  ||,  and  then  relate  that  bound  to  /. 
For  the  Gaussian  kernel,  L  =  1  l(crs[e)  and3 

R  ~  F  ^T{d/^2)2^  +  V21o§  CD/2)j  /o'  <  (VJ  + V21og  (D/2))  /cr. 


Thus 


E||/||co  <  (e~1/2  +  Vj  +  V21og(D/2))  . 


(3.12) 


Analagously,  for  the  z  features: 

Proposition  3.9.  Let  k,  X,  £,  and  Q(a>)  be  as  in  Proposition  3.6.  Define  z  by  (3.2),  and  f(x,  y)  :  = 
z(x)Tz(y)  -  k(x,y).  Suppose  k( A)  is  L-Lipschitz.  Let  R'  :=  EmaxI-=ij...,£)||cu;||.  Then,  for  X  and 
D  not  extremely  small, 


< 


48  yxe^d 


Vd 


(R'  +  L) 


where  0.803  <  y^  <  1.542.  See  Appendix  B.6  for  details  on  y'x  and  the  “not  extremely  small” 
assumption. 

The  proof  is  given  in  Appendix  B.6.  It  is  similar  to  that  for  Proposition  3.8,  but  the  lack  of 
shift  invariance  increases  some  constants  and  otherwise  slightly  complicates  matters.  Note  also 
that  the  R'  of  Proposition  3.9  is  slightly  larger  than  the  R  of  Proposition  3.8. 

These  two  bounds  are  both  quite  loose  in  practice. 


3.3.4  Concentration  about  the  mean 

Bousquet’s  inequality  (2002)  can  be  used  to  show  exponential  concentration  of  sup  /  about  its 
mean. 

We  consider  /  first.  Let 


il(A)  :=  (cos(mTA)  -  k(A)j , 

3By  the  Gaussian  concentration  inequality  (Boucheron  et  al.  2013,  Theorem  5.6),  each  ||w||  -  E[|«||  is  sub- 
Gaussian  with  variance  factor  cr-2;  the  claim  follows  from  their  Section  2.5. 
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so  /(A)  =  X^2  /L,(A).  Define  the  “wimpy  variance”  of  / /2  (which  we  use  so  that  |//2|  <  1)  as 


D/2 


o-j/2  :=  sup  ^Var  [|/Wi(A)] 

sup  ^  Var  [cos(u/A)] 


As*a 


=  jtz  sup  [1  +  k(2A)  -  2k (A)2] 

4 u  As^a 


1  2 
4  D°"w' 


Clearly  0  <  cr2  <  2;  for  pixie  kernels,  cr2  <  1,  with  it  approaching  unity  for  typical  kernels  on 
domains  large  relative  to  the  lengthscale. 

Proposition  3.10.  Let  k,  X,  and  Q(oj)  be  as  in  Proposition  3.6,  and  z  be  defined  by  (3.1).  Let 
/(A)  =  z(x)Tz(y)  -  k(A)for  A  =  x  -  y,  and  cr2  :  =  supAg^A  1  +  k( 2 A)  -  2k(A)2.  Then 


Pr 


-EH/Hoc  >  e)  <  2  exp 


Ds 2 


8 D  E\\f\\00  +  2crl  +  ±Dsl 


Proof.  We  use  the  Bernstein- style  form  of  Theorem  12.5  of  Boucheron  et  al.  (2013)  on  /( A)/2 
to  obtain  that 


Pr  ( sup  ^  -  E  sup  ^  >  t  ]  <  exp 


Pr  (sup  /  -  E  sup  f  >  s)  <  exp 


=  exp 


2Esup  /  +  jpcr-  +  f 

Ds 2 


8  D  E  sup  /  +  2  cr2  +  | Ds  t 


The  same  holds  for  -/,  and  Esup  /  <  E||/'||00,  Esup (-/)  <  EH/Hco.  The  claim  follows  by  a 
union  bound.  □ 


A  bound  on  the  lower  tail,  unfortunately,  is  not  available  in  the  same  form. 

For  /,  note  |/|  <  3,  so  we  use  // 3.  Letting  fco,b(x,  y)  :=  ^(cos(cl»t(x  -  y))  +  cos(mT(x  +  y)  + 
2b)  -  k(x,  y)),  we  have 


D 

sup  V  Var  [i/w,.A(A)] 

x,yeAr  frf 

sup  FT,  t1  +  ^(2A)  -  ^(A)2] 

x.veA' 


18  D 


(1  +  cr~). 
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Thus  the  same  argument  gives  us: 

Proposition  3.11.  Let  k,  X,  and  Q(co)  be  as  in  Proposition  3.6.  Let  z  be  as  in  (3.2),  f(x,  y)  :  = 
z(x)Tz.(y)  -  k(x,y),  and  define  crw  as  in  Proposition  3.10.  Then 


Pr 


E||/|U  >  e  <  2  exp 


Ds 2 


12 D  EII/IU  +  ±(1  +  o-l)  + IDs } 


Note  that  the  bound  for  /  strictly  dominates  the  bound  for  /  only  in  the  unlikely  case  of 


(T-  <  J. 


3.3.5  Other  bounds 


Sriperumbudur  and  Szabo  (2015)  later  proved  a  rate-optimal  Op(D  ^2)  bound  on  ||/||oo-  Phrased 
in  the  terminology  we  use  here,  it  amounts  to: 


Pr  (ll/llco  >  s) 


£  < 


h 

s/dJi 


otherwise 


where  h  :  - 


32yj2d\og(£  +  1)  +  32^2d\og(crp  +  1)  +  16  ^ 


2d 


log(^  +  1) 


In  practice,  for  moderately- sized  inputs,  the  constants  can  be  much  worse  than  the  non-optimal 
bound  of  Proposition  3.6.  For  example,  the  regime  of  Figure  3.5  is  d  =  1,  £  =  6,  crp  =  1.  In 
that  setting,  the  smallest  D  for  which  even  Pr  (||/||co  >  1)  can  be  shown  to  be  less  than  unity  is  a 
staggering  D  -  21  392,  compared  to  the  500  plotted  for  the  other  bounds. 


3.4  Downstream  error 

When  we  use  random  Fourier  features,  the  final  output  of  our  analysis  is  not  simply  estimates  of 
the  values  of  the  kernel  function;  rather,  we  wish  to  use  this  kernel  within  some  machine  learning 
framework.  A  natural  question,  then,  is:  how  much  does  the  use  of  a  random  Fourier  features 
approximation  change  the  outcome  of  the  prediction  compared  to  if  we  had  used  the  exact  kernel? 

One  approach  to  answering  this  question  is  to  study  the  difference  between  functions  in  the 
original  kernel  rkhs  versus  functions  in  the  rkhs  corresponding  to  the  approximation.  This  is 
the  approach  taken  by  Rahimi  and  Recht  (  008a, b),  as  well  as  the  later  work  of  Bach  ( 1015)  and 
Rudi  et  al.  (2016).  Rudi  et  al.  (2016),  in  particular,  provide  an  invaluable  theoretical  study  of  the 
effect  of  using  random  features  in  regression  models. 

In  some  contexts,  however,  we  would  prefer  to  consider  not  the  learning-theoretic  convergence 
of  hypotheses  to  the  assumed  “true”  function,  but  rather  directly  consider  the  difference  in 
predictions  due  to  using  the  z  embedding  instead  of  the  exact  kernel  k.  We  give  a  few  such 
bounds  here.  We  stress,  however,  that  combining  these  results  with  standard  learning  rates  for 
the  models  yields  worse  bounds  compared  to  those  of  Bach  (2015)  and  Rudi  et  al.  (  016). 
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3.4.1  Kernel  ridge  regression 

We  first  consider  kernel  ridge  regression  (krr;  Saunders  et  al.  1998).  Suppose  we  are  given  n 
training  pairs  (jc,-,  yi)  e  W1  x  R  as  well  as  a  regularization  parameter  A  =  nAo  >  0.  We  construct 
the  training  Gram  matrix  K  by  Ktj  =  k{xi,xf).  krr  gives  predictions  h(x)  =  aJkx,  where 
a  =  (K  +  AIf]y  and  kx  is  the  vector  with  /th  component  k(xj,  x).4  When  using  Fourier  features, 
one  would  not  use  a,  but  instead  a  primal  weight  vector  w;  still,  it  will  be  useful  for  us  to  analyze 
the  situation  in  the  dual. 

Proposition  1  of  Cortes  et  al.  (2010)  bounds  the  change  in  krr  predictions  from  approximating 
the  kernel  matrix  K  by  K,  in  terms  of  \\  K-K\\2.  They  assume,  however,  that  the  kernel  evaluations 
at  test  time  kx  are  unapproximated,  which  is  certainly  not  the  case  when  using  Fourier  features. 
We  therefore  extend  their  result  to  Proposition  3.12  before  using  it  to  analyze  the  performance  of 
Fourier  features. 

Proposition  3.12.  Given  a  training  set  {(x„  y;)}"=1,  with  X{  e  Rd  and  y,  6  R,  let  h(x)  denote  the 
result  of  kernel  ridge  regression  using  the  psd  training  kernel  matrix  K  and  test  kernel  values 

A.  A 

kx.  Let  h(x)  be  the  same  using  a  psd  approximation  to  the  training  kernel  matrix  K  and  test 
kernel  values  kx.  Further  assume  that  the  training  labels  are  centered,  £”=  i  V/  =  0,  and  let 
:=  7i  SU  y'r  Also  suppose  i/.\,l|:,  <  u.  Then: 

I h\x)  -  h(x) |  <  -  kx\\  +  K-^X\\k  -  K\\z. 

\nA()  hAq 

Proof.  Let  a  =  (K  +  AI)~ly,  a  =  (K  +  d/)_1y.  Thus,  using  M~l  -  -  M)M~\ 

we  have 

a  -  a  =  -(K  +  AI)~\K  -  K)(K  +  AI)~ly 

\\a  -  all  <  ||(Jf  +  Air'  lb  \\r  -  Kh  \\!K  +  Alp  lb  Ill'll 
<  ^IIJf-Klb  llyll 

since  the  smallest  eigenvalues  of  K  +  Al  and  K  +  Al  are  at  least  A.  Since  ||kv||  <  xfnK  and 

ll^ll  <  M/A: 


I h(x)  -  h(x) |  =  | aJkx  -  aJkx\ 

=  \aJ(kx  -  kx)  +  (a  -  a)1  kx 


< 

< 


\a 


A 


11^. 


kx II  +  \\& ~  a\\\\kx\\ 

s-kA  +  ^fA\\£-Kh. 


The  claim  follows  from  A  =  nAo,  ||y||  =  sfnay.  □ 

4If  a  bias  term  is  desired,  we  can  use  k'(x,  x')  =  k(x,  x')+  \  by  appending  a  constant  feature  1  to  the  embedding  z. 
Because  this  change  is  accounted  for  exactly,  it  affects  the  error  analysis  here  only  in  that  we  must  use  sup|fe(.r,  y)  <  2, 
in  which  case  the  first  factor  of  (3.13)  becomes  (Aq  +  2)/Afy 
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Suppose  that,  per  the  uniform  error  bounds  of  Section  3.3.2,  sup \k(x,  y )  -  six,  v)|  <  s.  Then 
|| kx  -  kx ||  <  sfns  and  ||£  -  A"||2  <  ||l£  -  K\\p  <  ns,  and  Proposition  3.12  gives 


| h(x)  -  h(x) | 


CTy  cry  d0  +  1 

<  - yns  H - r-77£  <  — - — 

<  % 


yfnAo 


crys. 


(3.13) 


Thus 


Pr  (|  h'(x)  -  h(x) |  >  s)  <  Pr 


& 


(do  +  l)cr  v 


which  we  can  bound  with  Proposition  3.6  or  3.7.  We  can  therefore  guarantee  | h(x)  -  h'(x)\  <  s 
with  probability  at  least  6  if 


D  =  Q 


/  /  \  2 
(Ao  +  1)0-, 


V 


Ks 


1  (do  +  1  )0~y 

log  -  +  log - - -  +  log  (T pt 

6  A-0s 


Note  that  this  rate  does  not  depend  on  n. 

If  we  want  \h'(x)  -  h(x)\  =  O  |  j  in  order  to  match  h(x)’ s  convergence  rate  (Bousquet  and 

Elisseeff  2001),  ignoring  the  logarithmic  terms,  we  thus  need  D  =  Q(n),  matching  the  conclusion 
of  Rahimi  and  Recht  (2008a).  It  is  worth  saying  again,  however,  that  Bach  (2015)  and  Rudi  et  al. 
(20 1 6)  obtained  better  rates  depending  on  the  form  of  the  particular  learning  problem. 


3.4.2  Support  vector  machines 

We  will  now  give  a  similar  bound  for  svm  classifiers.  We  will  see  that  this  method  gives  much 
worse  results  than  in  the  ridge  regression  case;  rkhs  analyses  should  be  used  here  instead. 

Consider  an  svm  classifier  with  no  offset,  such  that  h(x)  =  m'T<1>(v)  for  a  kernel  embedding 
®(x)  :  X  — » <H  and  w  is  found  by 

argmin  — ||w||2  +  —  V  max(0, 1  -  y,(w,  ®(.r,))) 

weW  2  n 

where  {(x;-,  >i)}"=1  is  our  training  set  with  yt  6  {-1, 1},  and  the  decision  function  is  h(x)  = 
(w,  ®(x)).5  For  a  given  x,  Cortes  et  al.  (2010)  consider  an  embedding  in  Tf  =  R”+1  which  is 
equivalent  on  the  given  set  of  points.  They  bound  \h(x)  -  h(x)\  in  terms  of  ||£  -  K\\i  in  their 
Proposition  2,  but  again  assume  that  the  test-time  kernel  values  kx  are  exact.  We  will  again  extend 
their  result  in  Proposition  3.13: 

Proposition  3.13.  Given  a  training  set  {{x„  y;)}"=1,  with  Xj  6  Wl  and  y,  6  {-1,1 },  let  h(x)  denote 
the  decision  function  of  an  svm  classifier  using  the  psd  training  matrix  K  and  test  kernel  values 

/V  /V 

kx.  Let  h(x)  be  the  same  using  a  psd  approximation  to  the  training  kernel  matrix  K  and  test  kernel 
values  kx.  Suppose  supx  k(x,x)  <  k.  Define  6X  :=  ||tk  -  iC||2  +  || kx  -  kx||  +  \k(x,  x)  -  k(x,x)\. 
Then: 

| h(x)  -  h(x) |  <  V2  k*  Co  6XJ4  +  x[kCq  . 

5We  again  assume  there  is  no  bias  term  for  simplicity;  adding  a  constant  feature  again  changes  the  analysis  only 
in  that  it  makes  the  k  of  Proposition  3.13  2  instead  of  1. 
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Proof.  Use  the  setup  of  Section  2. 2  of  Cortes  etal.  (2010).  In  particular,  we  will  use  ||w||  <  a/kCo 
and  their  (16-17): 


w  -  w 


®  (Xi)  =  KlJ2e 

i2  ^  ~^2 


<2C2^\\kI/2-kI/2\ 


where  Kx 


and  ei  is  the  z'th  standard  basis.  Also  let  fx  :  =  k(x,  x )  -  k(x,  x). 


K  kx 
kJx  k(x,  x) 

Further,  Lemma  1  of  Cortes  et  al.  (2010)  says  that  \\KXZ  -  Kx'~  ||2  <  ||  Kx  -  KxU  2 
fx  :=  k(x,  x)  -  k(x,  x):  then,  by  Weyl’s  inequality  for  singular  values, 


Let 


K-K  kx-kx 
k]  ~  k]  fx 


<  \\K-K\\2  +  \\kx-kx\\  +  \fx\. 


Thus 


\h(x)  -  h(x)  | 

=  |(w  -  w)tO(x)  +  wT(6(;e)  -  0(jc))| 

<  ||w  -  w||||6(x)||  +  ||w||||6(*)  -  0(jc)|| 

<  V2^?  coii^y2  -  + v^coii^y2  -  ^/2k+ in 

<  ^Co\\Kx  -  Kx liy4  +  V^Co\\Kx  -  Kx ||1/2 

<  V2^5C0  (ll K  -  K\\2  +  II kx  -  kx\\  +  \fx\)' 4  +  V^Co  (\\K  -  AT||2  +  11  kx-  kx\\  + 

as  claimed.  □ 

Suppose  that  sup|fc(.r,  y)  -  s(x,  y)|  <  s.  Then,  as  in  the  last  section,  \\kx  -  kx\\  <  ^is  and 
||A  -  AT 1 12  <  ns.  Then,  letting  y  be  0  for  z  and  1  for  z,  Proposition  3.13  gives 

| h{x)  -  h(x)\  <  V 2Co  [n  +  yjn  +  yj  e1^4  +  Co  [n  +  Vn  +  yj  s1^2. 

Then  \h(x )  -  h(x) \  >  u  only  if 

2Cq  +  ACqu  +  u~  —  2(Co  +  u)^Co(Co  +  2n) 

c](n  +  V«  +  r) 

This  bound  has  the  unfortunate  property  of  requiring  the  approximation  to  be  more  accurate 
as  the  training  set  size  increases,  and  thus  can  prove  only  a  very  loose  upper  bound  on  the 
number  of  features  needed  to  achieve  a  given  approximation  accuracy,  due  to  the  looseness  of 
Proposition  3.13.  Analyses  of  generalization  error  in  the  induced  rkhs,  such  as  Rahimi  and  Recht 
(2008a),  T.  Yang  et  al.  (2012),  and  Bach  (2015),  are  more  useful  in  this  case. 
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3.5  Numerical  evaluation  on  an  interval 


We  will  conduct  a  detailed  study  of  the  approximations  on  the  interval  X  =  [—b,  b].  Specifically, 
we  evenly  spaced  1  000  points  on  [-5, 5]  and  approximated  the  kernel  matrix  using  both  embed¬ 
dings  at  D  G  {50, 100,  200, . . . ,  900, 1  000, 2  000, . . . ,  9  000, 10  000},  repeating  each  trial  1  000 
times,  estimating  ||/||oo  and  ||/||^  at  those  points.  We  do  not  consider  d  >  1,  because  obtaining 
a  reliable  estimate  of  sup|/|  becomes  very  computationally  expensive  even  for  d  =  2. 

Figure  3.3  shows  the  behavior  of  EH/Hco  as  b  increases  for  various  values  of  D.  As  expected, 
the  z  embeddings  have  almost  no  error  near  0.  The  error  increases  out  to  one  or  two  bandwidths, 
after  which  the  curve  appears  approximately  linear  in  77<x,  as  predicted  by  Propositions  3.8 
and  3.9. 


- z,  £= 50 

-  2,  £= 50 

- 2,  £=  100 

-  5,  £=100 

- 2,  £>=500 

-  2,  £=500 

2,  £=1000 
-  2,  £=1000 


0  1  2  3  4  5 

l/(2a) 

Figure  3.3:  The  maximum  error  within  a  given  radius  in  R,  averaged  over  1  000  evaluations. 

Figure  3.4  fixes  b  =  3  and  shows  the  expected  maximal  error  as  a  function  of  D.  It  also  plots 
the  expected  error  obtained  by  numerically  integrating  the  bounds  of  Propositions  3.6  and  3.7 
(using  the  minimum  of  1  and  the  stated  bound).  We  can  see  that  all  of  the  bounds  are  fairly  loose, 
but  that  the  first  version  of  the  bound  in  the  propositions  (with  fid,  the  exponent  depending  on  d, 
and  ae)  is  substantially  tighter  than  the  second  version  when  d  =  1 . 

The  bounds  on  E||/||oo  of  Propositions  3.8  and  3.9  are  unfortunately  too  loose  to  show  on  the 
same  plot.  However,  one  important  property  does  hold.  For  a  fixed  X  and  k,  (3.12)  predicts  that 
E||/|U  =  0(1 /Vd).  This  holds  empirically:  performing  linear  regression  of  log  EH/Hoo  against 
log  D  yields  a  model  of  E||  /Hoo  =  ecD'n,  with  a  95%  confidence  interval  for  m  of  [-0.502,  -0.496] ; 
ll/lloo  gives  [-0.503,  -0.497].  The  integrated  bounds  of  Propositions  3.6  and  3.7  do  not  fit  the 
scaling  as  a  function  of  D  nearly  as  well. 
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Figure  3.4:  E||/||oo  for  the  Gaussian  kernel  on  [-3, 3]  with  cr  =  1,  based  on  the  mean  of  1  000 
evaluations  and  on  numerical  integration  of  the  bounds  from  Propositions  3.6  and  3.7.  (“Tight” 
refers  to  the  bound  with  constants  depending  on  d,  and  “loose”  the  second  version;  “old”  is  the 
version  from  Rahimi  and  Recht  (2007).) 


Figure  3.5  shows  the  empirical  survival  function  of  the  max  error  for  D  =  500,  along  with  the 
bounds  of  Propositions  3.6  and  3.7  and  those  of  Propositions  3.10  and  3.11  using  the  empirical 
mean.  The  latter  bounds  are  tighter  than  the  former  for  low  s,  especially  for  low  D,  but  have  a 
lower  slope. 

The  mean  of  the  mean  squared  error,  on  the  other  hand,  exactly  follows  the  expectation  of 
Propositions  3.4  and  3.5  using  fi  as  the  uniform  distribution  on  X2\  in  this  case,  E||/||^  ~  0.66/D, 
Ell/llyu  ~  0.83/D.  (This  is  natural,  as  the  expectation  is  exact.)  Convergence  to  that  mean, 
however,  is  substantially  faster  than  guaranteed  by  the  McDiarmid  bounds. 
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-  tight  z 

- tight  z 

-  loose  z 

- loose  z 

-  old  bound 

-  new  z  with  empirical  mean 

- new  z  with  empirical  mean 

-  empirical  z 

- empirical  z 


Figure  3.5:  Pr(E||/||00  >  e)  for  the  Gaussian  kernel  on  [-3,3]  with  cr  =  1  and  D  =  500,  based 
on  1  000  evaluations  (black),  numerical  integration  of  the  bounds  from  Propositions  3.6  and  3.7 
(same  colors  as  Figure  3.4),  and  the  bounds  of  Propositions  3.10  and  3.11  using  the  empirical 
mean  (yellow). 
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Chapter  4 

Scalable  distribution  learning  with  approx¬ 
imate  kernel  embeddings 


We  now  return  to  the  distributional  setting,  developing  embeddings  for  distributions  in  the  style 
of  —  and  employing  —  the  random  Fourier  features  studied  in  Chapter  3. 


4.1  Mean  map  kernels 

Armed  with  an  approximate  embedding  for  shift-invariant  kernels  on  W1,  we  now  need  only  a 
simple  step  to  develop  our  first  embedding  for  a  distributional  kernel,  mmk.  Recall  that,  given 
samples  {A,}”=|  ~  Pn  and  {Yj}J=l  ~  Qm ,  mmk(P,  Q )  can  be  estimated  as 


^  n  m 

mmk(X,  Y)  =  —  y  V  k(Xi,  Yj). 

nm  4— 1  4 —t 


(4.1) 


i= 1  7=1 


Simply  plugging  in  an  approximate  embedding  z(x)Tz(y)  «  k(x,  y )  yields 


mmk(X,  Y) 


1 

nm 


n  m 


££z(Xi)T  Z(Y,) 
i= 1  7=1 


1 

n 


n 

YjZ(Xi) 


1 

m 


m 


Z^'> 


z(X)Jz(Y),  (4-2) 


where  we  defined  z  :  S  — >  RD  by  z(X)  :=  2  ^'J=|  z(X,).  This  additionally  has  a  natural 
interpretation  as  the  direct  estimate  of  mmk  in  the  Hilbert  space  induced  by  the  feature  map  z, 
which  approximates  the  Hilbert  space  associated  with  k. 

Thus  mmd(P,  Q)  «  ||z(X)  -  z(T)||.  Since  this  is  simply  a  Euclidean  distance,  the  generalized 
rbf  kernel  based  on  that  distance  CTMMD‘  can  be  approximately  embedded  with  z(z(-)). 

This  natural  approximation  has  been  considered  many  times  quite  recently  (Mehta  and  Gray 
2010;  S.  Li  and  Tsang  1 ;  Zhao  and  Meng  ;  Chwialkowski  et  al.  )  1 5 ;  Flaxman,  Y.-X. 
Wang,  et  al.  2015;  Jitkrittum,  Gretton,  et  al.  2015;  Lopez-Paz  et  al.  2015;  Sutherland  and 
Schneider  2015;  Sutherland,  J.  B.  Oliva,  et  al.  2016). 


37 


4.1.1  Convergence  bounds 

We  will  consider  two  approaches  to  proving  bounds  on  this  mmd  embedding. 


Applying  uniform  bounds 


The  following  trivial  bound  allows  the  application  of  uniform  convergence  bounds  to  mmk 
estimators.  Theorems  3  and  4  of  Zhao  and  Meng  (20b  )  appear  to  reduce  to  it. 

Proposition  4.1  (Uniform  convergence  of  ziX)1  z(Y)).  Let  z  :  X  — >  RD  be  a  random  approximate 
embedding  for  a  kernel  k  on  some  set  X  such  that  for  some  e  >  0,  0  <  6  <  1: 


Pr 


sup  |z(jc)tz(v)  -  k(x,y)\  >  e  <6. 

x,y£X 


(4.3) 


Define  mmk  :  S  x  S  — »  R  as  the  inner  product  between  the  mean  maps  under  kernel  k  between 
the  empirical  distributions  of  the  two  inputs,  as  in  (4.1).  Let  z  '■  S  —>  M.°  be  given  by  2(A)  :  = 
\ZUziXi)-  Then 


Pr 


sup  |z(X)Tz(y)  -  mmk(X,  Y)\  >  £  <6. 
X.YeS 


Proof.  For  any  X,Y  Q  X,  we  have 


|?(x)Tz(y)  -  mmk(x,  y)  | 


l 

run 


n  m 

'fJYj(z(Xpz(Yj)-k(Xi.Yj)) 


i= 1  7=1 


< 


1 

flffl 


n  m 


zz  \z(Xi)Jz(Yj)-k(XhYj)\. 

i=  1  7=1 


If  (4.3)  holds,  clearly  this  quantity  is  at  most  s  for  all  X,  Y.  □ 

Corollary  4.2  (Uniform  convergence  of  ||z(X)  -z(y)||  to  the  pairwise  estimator).  Let  z,  k, 
s,  6,  and  mmk  be  as  in  Proposition  4.1.  Define  mmdj  :  S  x  S  — >  R  as  mmdj(X,  Y)2  :  = 
mmk(X,  X)  +  MMK(y,  y)  -  2mmk(X,  y).  Then 


Pr 


sup  ||| z(X) 
X.YeS 


f(y)H2-ssiBj(x,y)|  >4 s\<6. 


Proof.  With  probability  at  least  6,  each  of  |z(X)Tz(X)  -  mmk(X,  X)|,  |z(y)Tz(y)  -  mmk(P,  y) | , 
and  |2(X)T2(y)  -  mmk(I,  y)|  are  at  most  s  by  Proposition  4.1,  □ 

Proposition  4.3  (Convergence  of  ||z(A)  -  z(y)||  to  the  mmd).  Let  z,  k,  6,  and  mmk  be  as  in 
Proposition  4.1,  but  additionally  requiring  that  k  (x,  y)  >  0  for  all  x,  y  6  X.  Fix  a  pair  of  input 
distributions  P,  Q  over  X.  Take  X  ~  P",  Y  ~  Q"‘;  then  for  any  gMMD  >  0  we  have 


Pr 

X,Y,z 


||z(A)  -  z(y)||  -  mmd(P,  Q) I  > 


-  +  _  +  ^MMD  +  32  sz 

V  m  yn 


<  2  exp 


m n £Ld  \  (  5 

2  (m  +  77 ) ) 


38 


Proof.  For  the  sake  of  brevity,  let  p2  :=  ||z(X)  -  z(F)||,  r]xY  mmd^X,  Y),  pPQ  :=  mmd(P,  Q ), 

cmn  :=  -==  +  -==.  Thus  we  wish  to  bound 
yfm  a fn 

Pi  (| dz  ~  d  PQ |  ^  C/nn  ®mmd  "T  32 £g)  <  Pi'  (|t?j  —  7/Xf|  "t"  1^7X7  —  ^  Cmn  ^mmd  F  32s?) 

X,Y,z  X,Y,z 

<  Pr_  {\dz  -  dxy\  >  32 s2)  +  Pr  (|r/xr  -  pPQ |  >  c,m;  +  eMMD)  . 

X,Y,z  X,Y 

Theorem  7  of  Gretton,  Borgwardt,  et  al.  (2012)  bounds  the  latter  term: 

Pr  {\nxY  -  Vpq\  >  cmn  +  sMmd)  <  2 exp  (-/”??£mm°)  . 
x.Y v  '  \  2 (m  +  n)J 

For  the  former,  note  that  //2  and  t]2xy  are  each  in  [0, 4],  so 

\n\  -  dXY I  =  \dz  -  dxy\  I rn  +  tjXy\  <  B|/75  -  T]XY\. 

Thus  by  Corollary  4.2, 

{\dz  ~  *1xy\  >  32 s=)  <  Pr_  ||/7|  -  p2XY\  >  4 e2J  =  Ex,y  Pr  (\pj  -  r]2XY\  >  4s^  <  EXj6  =  6.  a 

Proposition  4.4  (Convergence  of  kernel  approximation  for  a  given  P,  Q ).  Let  z,  z,  k,  s=,  5, 
mmk,  P,  Q,  X,  Y,  n,  and  m  be  as  in  Proposition  4.3,  with  the  z.  embedding  into  dimension  D\. 

Define  a  kernel  on  distributions  K(P,  Q )  :=  exp  (-+^2  mmd2(P,  Q)j  for  some  bandwidth  cr  >  0. 

Let  kcr(x,  y)  :=  exp  ^-^4  ||x  -  y 1 1 2  j  be  the  Gaussian  rbf  kernel  of  bandwidth  cr,  and  za  its 
embedding  using  either  z  or  z  with  embedding  dimension  D2.  Estimate  the  kernel  K(P,  Q )  as 
Zcr(z(X))Tzcr(z(y)).  Then  for  any  ^MMD  ^  0,  P-,r ' 


X,Y,Z,Z<r 


Pr  I  zM(X))JzAz(Y))-K(P. 


,  1/2  2 
>2)  I  >  — “p  +  ~j= 

(rye  \  s/m  sjn 


32e7 1  +  s 


'Z<T 


<  2  exp  - 


m  n  £7 


2{m  +  n) 


+  6  +  2  exp 


D2£ 


i 


8  +  3  sZo- , 


Proof  Define  r a  :  R  — >  R  by  r(r(x)  :  =  exp  (~x2/(2cr2)).  Let  rp  :=  ||z(Z)  -  z(F)||,  t]pq  :  = 
mmd(P,  Q).  Then  the  error  in  question  is 

VcrippQ)  -  z<r(z(X))J  z<r(z(Y))\  <  | ra(ppQ)  -  \  +  \ra(pf)  -  zM(X))J  zAz(Y))\ . 

The  first  term,  because  r(T  is  ^=-Lipschitz,  is  at  most  \ppq  -  pi\.  Using  Proposition  4.3: 


Pr  f  \dPQ  ~  dz\  >  -7=  +  -7=  +  £mmd  +  32ef  1  <  2  exp 
x,y,z  \  \n 


m  n  er 


2  (m  +  n) 


+  6. 
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The  latter  term  is  just  the  error  of  the  Za  embedding  on  the  inputs  z(X),  z(Y).  We  can  use  the 
Bernstein  bound  of  (B.3)  and  (B.6),  simplifying  it  a  bit  because  Var[cos(a>TA)]  <  ^  for  pixie 
kernels: 


Pr 

X,Y,Z,Z<r 


(\nAm)  -  zAz(Y))\  > 

=  EXj,=  Pr  (\rAm)  ~  z<x(2(20)T^(2(y))| 

Z(T  ' 


>  £ 


Z(T 


<  E x,y  =  2  exp 


D2s 


Zcr 


8+3  SZcr  , 


=  2  exp 


D2s 


Zcr 


8  +  3  Szo- , 


□ 


It  is  worth  re-emphasizing  two  points:  first,  that  the  6  of  these  bounds  will  depend  on  the 
diameter  of  X,  and  so  they  are  not  directly  applicable  to  distributions  on  unbounded  domains. 
Secondly,  extension  of  Proposition  4.4  to  a  bound  uniform  over  input  distributions  would  require 
a  uniform  version  of  Proposition  4.3,  presumably  based  on  a  uniform  extension  of  Theorem  7  of 
Gretton,  Borgwardt,  et  al.  (  2).  This  could  be  done  e.g.  by  bounding  the  Lipschitz  constant 

of  the  error  of  the  mmd  estimator  over  some  smoothness  class  of  distributions,  as  in  the  proof  of 
Proposition  3.6. 


For  fixed  inputs 

We  can  also  show  bounds  more  directly  for  a  fixed  pair  of  inputs  (fixed  sample  sets  X,  Y  at  first; 
later,  for  fixed  distributions  P,  Q).  This  approach  will  allow  us  to  consider  unbounded  domains, 
but  does  not  allow  for  direct  uniform  results  as  in  Proposition  4.1  and  Corollary  4.2. 

Proposition  4.5  (Convergence  of  z(X)Tz(y)  for  fixed  X,  Y ).  Let  z  :  X  — >  RD  be  either  z  of 
(3.1)  or  z  of  (3.2),  corresponding  to  a  continuous,  shift-invariant,  positive  definite  kernel  function 
k(x,  y )  =  k(x  -  y )  with  k( 0)  =  1.  Let  z(X)  :=  2  ffl=\  z(Xj).  Then,  considering  X  Q  X  of  size  n 
and  Y  Q  X  of  size  infixed: 

(i)  The  variance  of  the  mmk  embedding  is: 

Var  [f(X)Tf(y)]  =  ^  X  Z  C°v(z(X;)Tz(y7),  zdX,)1  z(Yy)), 
o  m 


which  for  z  is 

1  „ 

Vx,Y  ■=  —  VX,Y 


and  for  z  is 


5^?  Z  Z  l±(X,  -  X ,  -  Yj  +  Yy)  +  MX,  +  X,  -  Yj  -  Yy ) 
Uj  i'J' 


-2 k(Xi  -  Yj)k(Xr  -  Yr)]  (4.4) 


Vx,Y  JfVX,Y 


3  th  Z  Z  \.Y-(X  -  X,  -  Yj  +  Yy) +  \MX  +  X,  -  Yj-Yy ) 

hj  i'J1 

-k(X,  -  Yj)k(Xi/  -  Yy)  +  \k(Xj  -  Xt>  +  Y,  -  Yy)]  .  (4.5) 
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Note  that 


vxj  = 


vx,y 


hj  i',j'  / 


( ii )  Let  ayY  :=  min  (4, 2  vx,y  +  tel,  using  the  variance  factor  vx.y  of  (4.4).  Similarly  use  (4.5) 


x,y  \T’ 1  3‘ 

to  define  afy  :  =  min  18, 2 vx.y  +  fe).  Then,  letting  aXY  denote  axJY  for  the  z  embedding 


LX,Y 

w  (g) 

and  aX  Yfor  the  z  embedding: 


>  sj  <  2  exp 

(  Ds2\ 

(e) 

\  aX,Y ) 

(iii)  Let  a(oo>  be  4  for  the  z  embedding  and  8  for  z.  Then 


E|z(z)T2(y)-si^5(x,y)|  < 


Proof. 

(i)  Simply  expand  2(X)Tz(y)  into  a  sum,  as  in  (4.2),  and  use  (3.3)  and  (3.5). 

(ii)  For  z,  we  can  think  of  z(X)Jz(Y)  as  an  average  of  ^  terms  like 

1  n  m 

-EZcosK®-# 

(=1  7=1 

each  of  which  has  mean  mmk(X, Y),  variance  Ivx.f,  and  is  bounded  by  [-1, 1].  The 
claim  gives  the  better  of  Hoeffding’s  and  Bernstein’s  inequalities;  the  latter  is  tighter  when 
s  <3-  \vx,y- 

Similarly,  z  gives  an  average  of  D  terms  like 

|  n  m 

—  Yj  Tj  [C0s  (^T(X/  “  F^)  +  C0S  (^T(X/  +  Yj)  +  2b)]’ 

i= 1  7=1 

each  of  which  has  mean  mmk(X,  y),  variance  vx,y,  and  is  bounded  by  [-2,2].  Here 
Bernstein’s  is  tighter  for  e  <  6  -  |  vx,y ■ 

(iii)  Integrate  the  Hoeffding-form  bound  of  (ii),  using  E|X|  =  J()'  Pr  (|X[  >  s)  de.  □ 

Note  that  Proposition  4.5(i)  gives  that  the  variance  in  terms  of  D  is  exactly  TLX.  (with  vx,y 
depending  only  on  k,  X,  and  Y ),  whereas  Proposition  4.1  does  not  allow  for  an  easy  form  for  the 
variance  when  used  with  Propositions  3.6  and  3.7. 

We  can  easily  extend  this  to  a  convergence  bound  on  the  mmd  embedding.  The  variance  is 
also  available  via  the  same  technique  as  Proposition  4.5(i),  and  is  still  (9(1  /D). 
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Corollary  4.6  (Convergence  of  ||z(X)  -  z(F)||2  for  fixed  X,  Y ).  Let  z,  z,  k,  X,  Y,  m,  n,  and 
be  as  in  Proposition  4.5.  Define  mmd/,  as  in  Corollary  4.2.  Then: 

Pr(|  ||z(X)-z(y)||2  -SSi5j(X,F)|  >  e)  <  6  exp  • 

Proof.  We  can  upper-bound  |  ||z(X)  -  z(F)||2  -  mmd^X,  F)2|  by 

|z(Z)Tf(X)  -  X)\  +  |z(F)Tz(F)  -  5SS(Y,  Y)\  +  2|z(A)T£(F)  -  S^(X,  Y)\. 

Use  the  Hoeffding  version  of  Proposition  4.5(ii)  with  for  each  term,  then  a  union  bound.  □ 


We  can  also  allow  X  ~  P,  Y  ~  Q  to  be  random: 

Corollary  4.7  (Variance  of  z(X)Tz(F)  for  random  X,  Y).  Let  z,  z,  k  be  as  in  Proposition  4.5.  Let 
mmk  denote  the  inner  product  between  mean  embeddings  with  the  kernel  k.  Fix  distributions  P, 
Q  over  X.  Letting  X,  X'  P,  denote  the  distribution  ofX  -  X'  as  A p  and  X  +  X'  as  Tp.  Similarly 
define  A q  and  Tq •  Then  the  expected  variance  of  the  embedding-based  mmk  estimator  for  z.  is: 


Vp,q  :=  Bx„P'Y~q  Var  [z(*)Tz(Y)] 


1 

D 


MMK  (Ap,  A q)  +  MMK  {Tp,  Tq)  -  2  MMK  (P,  Q )2 


and  for  z  is: 


Vp,Q  :=  Ex~p,f~2  Var  [f(X)Tz(Y)] 


1 

D 


MMK 


(Ap,  A q)  + 


^  MMK  ( Tp ,  Tq)  -  MMK  ( P ,  Q )2 


Note  that  Vpq  is  not  the  “full”  variance  of  the  estimator,  which  is 

Var x,Y,z  [z(^)Tz(P)]  =  VP,Q  +  Var^y  mmk(X,  F). 


Proof.  For  the  values  of  Vp,Q,  take  expectations  of  Proposition  4.5(i).  The  final  statement  is  just 
the  law  of  total  variance,  noting  that  Ej  [z(X)Tz(F)]  =  mmk(X,  F).  □ 

Corollary  4.8  (Convergence  of  ||z(V)  -  z(F)||  for  random  X,  F).  Let  z,  z,  k  be  as  in  Proposi¬ 
tion  4.5,  but  additionally  require  that  k  (x,  y)  >  0  for  all  x,  y.  Let  mmd  denote  the  maximum  mean 
discrepancy  with  kernel  k.  Fix  distributions  P,  Q  over  X,  and  let  X  ~  Pn  and  Y  ~  Qm.  Let  a<c>0) 
be  4  for  z.  and  8  for  z..  Then  for  any  eMMD,  >  0, 


Pr 

X,Y,Z 


||z(V)  -  z(F)||  -  mmd(P,  Q)\ 


2  2 

>  - +  —  + 

xfm  xfn 


^MMD 


<  2  exp 


\ 

8  (m  +  n) ) 


+  6  exp 


_E±_\ 

1024 a(°°) ) 


Proof.  The  argument  is  as  for  Proposition  4.3,  replacing  Corollary  4.2  with  Corollary  4.6.  □ 
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Corollary  4.9  (Convergence  of  kernel  approximation  for  a  given  P,  Q).  Let  z,  z,  k,  X,  Y, 
and  c?^00)  be  as  in  Corollary  4.8,  with  z  having  embedding  dimension  l)\.  Define  a  kernel 
K(P,  Q)  :=  exp  mmd2(P,  Q)j  for  some  bandwidth  cr  >  0.  Let  z<r  be  the  embedding  for  the 
Gaussian  rbf  kernel  of  bandwidth  cr,  using  either  z  or  z  of  embedding  dimension  LL;  define  the 
estimator  ofK(P,  Q)  as  K(X,  Y )  :=  z0-(z(X))Tz0-(z(F)).  Then  for  any  eMMD,  s=,  eZcr  >  0: 


Pr 

X,Y,z 


r(z(X))JzM(Y))-  K{P,Q)\ 


> 


I . 


+ 


crsfe  \  yjm  sjn 


+ 


+  £=  I  +  £r 


<  2  exp 


inns: 


32  (m  +  n) 


+  6  exp 


gig 

1024a(°°) 


+  2  exp 


d**L  I 

8  +  §£Z(r) 


Proof  As  for  Proposition  4.4,  using  Corollary  4.8  rather  than  Proposition  4.3. 


□ 


Converting  Corollary  4.9  to  a  bound  uniform  over  distributions  would  have  similar  challenges 
to  those  of  Proposition  4.4,  except  that  the  s=  term  would  similarly  need  to  be  treated  over  a 
smoothness  class  of  distributions,  whereas  Proposition  4.4  gets  that  “for  free”  via  Corollary  4.2. 


4.2  L2  distances 

J.  B.  Oliva,  Neiswanger,  et  al.  (2014)  gave  an  embedding  for  e~yLi,  by  first  embedding  Lo  with 
orthonormal  projections  and  then  applying  random  Fourier  features. 

Suppose  that  X  c  [0,  1  Jc/.  Let  {ipa}aeZd  an  orthonormal  basis  for  Z,2([0,  1  \d),  perhaps 
constructed  as  the  d-fold  tensor  product  of  an  orthonormal  basis  for  L2([0, 1]).  Then  any  function 
/  e  L2([0,  l]rf)  can  be  represented  as  f(x)  =  YjaeZd  aa{f)4>a(.x),  where 

««(/)  :=  (<Pa,f)  =  f  (pa(t)f(t)dt, 

and  for  any  f,g  e  L2([0,  l]rf), 

</.g>  =  (  X  X  ah^)Ttij 

\a£  Zd  y8eZd  / 

aa(J)ap(g)(<Pa,<Pi3) 

ae Zd  j8e Zd 

=  aa(f)aa(g)- 

aeZd 


Let  V  c  Zd  be  an  appropriately  chosen  finite  set  of  indices  {a\, . .  Define  a(f)  = 

( aai(f ),  •  •  • ,  «o'|V|(/))T  e  RJy'.  If  /  and  g  are  smooth  with  respect  to  V,  i.e.  they  have  only  small 
contributions  from  basis  functions  not  in  V,  we  have 

</»g>  =  J]  aMa^8)  »  J]aa(f)aa(g)  =  a(f)J  a(g). 
aeZd  a£V 
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Now,  given  a  sample  X  =  {X), . . . ,  X„}  ~  Pn,  let  P(x )  =  ^  2”=i  -  jc)  be  the  empirical 

distribution  of  X.  J.  B.  Oliva,  Neiswanger,  et  al.  (2014)  estimate  the  density  p  as 


C  |  n 

P(x)  =  V  aa(P)  <pa(x)  where  aa(P)  =  /  <pa(t)dP(t)  =  -  V  (pa(Xj).  (4 


6) 


Note  that  technically  this  is  an  extension  of  aa  to  a  broader  domain  than  Lq(\0,  l]rf).  Assuming 
that  the  distribution  functions  are  smooth  with  respect  to  V,  i.e.  they  lie  in  the  Sobolev  ellipsoid 
corresponding  to  the  basis  functions  of  V,  we  thus  have  that 

(p,  q)  *  (p,  q)  *  a(P)T  a(Q) 


and  so 

z(a(P))T z(a{Q))  ~  exp  ^||P  -  2||2j  • 

For  the  Sobolev  assumption  to  hold  on  a  fairly  general  class  of  distributions,  however,  we 
need  |  V|  to  be  Cl(Td)  for  some  constant  T.  Since  the  embedding  is  of  dimension  |V|,  this  method 
is  limited  in  practice  to  fairly  low  dimensions  d. 

J.  B.  Oliva,  Neiswanger,  et  al.  (2014)  proved  learning  theoretic  bounds  on  the  use  of  this 
estimator  with  ridge  regression.  Because  the  L2  embedding  is  deterministic,  the  convergence 
portion  of  the  bound  is  not  especially  interesting:  the  Sobolev  assumption  on  the  densities  is 
essentially  that  the  embedding  error  is  bounded  by  a  certain  amount. 


4.2.1  Connection  to  mmd  embedding 


The  components  of  the  embedding  (4.6)  are  of  the  form 

1  " 

aa(X)  =  -  y  <pa(Xi), 
n 

I- 1 

whereas  the  embedding  z  of  Section  4.1  has  components  of  the  form 


i= 1 


This  similarity  in  form  is  tantalizing,  but  how  similar  are  the  z.j  and  <pa  functions? 

Taking  a  more  general  view  of  the  mmd  embedding  than  solely  one  based  on  random  Fourier 
features,  the  L2  embedding  can  be  viewed  as  proportional  to  a  mean  map  embedding  in  the  Hilbert 
space  defined  by  the  basis  functions  {ya}a&v,  with  a  kernel  given  by  k(x,  y )  =  Zorev  lPa(x)‘Pa(y)- 
As  V  expands  to  Zd,  this  space  converges  to  L2([0,  l]d),  with  a  shift-invariant  kernel  of  the  Dirac 
delta  function. 

In  practice,  we  often  use  the  tensor  product  of  the  cosine,  Fourier,  or  trigonometric  bases  for 
L2([0, 1]).  However,  the  following  orthonormal  basis1  for  L2([0,  l]rf)  more  closely  resembles  a 
mean  map  embedding  with  the  z  random  Fourier  features: 

( po(x )  =  1  Vk{x)  =  V2  cos(27rkTJc),  k  6  'Kd  <Pk(.x)  -  V2sin(2 nkTx),  k  e  %d 


'This  is  not  in  standard  use,  but  we  can  see  that  its  span  is  dense  in  Li  via  the  Stone-Weierstrass  theorem. 
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where  'Kj  is  the  set  of  <7- vectors  with  integral  entries  with  at  least  one  nonzero  coordinate, 
the  first  of  which  is  positive:  TCi  =  {1,  2 TO  =  ({1, 2, . . .  }  x  Zd_1)  U  ({0}  x  7C/_i). 
This  restriction  is  needed  for  orthogonality  because  pk  =  (f-k,  and  ip'k  =  We  can 

obtain  an  almost  exactly  equivalent  L2  embedding,  however,  by  using  (K',  =  Id  \  0:  inner 
products  are  then  effectively  doubled,  except  for  the  constant  term  1.  Consider  the  index  set 
Vj  -  {0}  U  [k  e  <K’d  :  ma.Xj\kj\  <  T\.  Now,  note  that  for  kernels  whose  Fourier  transforms 
are  discrete  distributions,  sampling  without  replacement  in  the  z  embedding  still  works:  tighter 
versions  of  many  of  the  same  bounds  even  hold,  replacing  the  Hoeffding  or  Bernstein  bounds 
with  their  Serfling-style  analogues  (Serfling  197  ;  Bardenet  and  Maillard  2015).  Thus  the  z 
embedding  for  a  kernel  corresponding  to  the  Fourier  transform  of  a  uniform  distribution  over 
[-T,  T]d  has  the  exact  same  arguments  to  the  sine  and  cosine  terms,  except  for  adding  a  useless 
constant  0  dimension.  This  z  embedding  is  of  dimension  D  =  2(2 T  +  l)d,  and  is  scaled  by 
relative  to  the  L2  embedding.  The  kernel  being  embedded  is  the  tensor  product  of  a  normalized 
Dirichlet  kernel  on  each  dimension,  namely 


1 

2  T  +  1 


1  +  2  ^  cos(2nkAj) 
k= 1 


pr  sin  {(IT  +  1 )  nAj) 

1  |  (2 T  +  l)sin(;rA;)  ’ 

7  =  1 


The  Dirichlet  kernel  is  well-known  in  the  theory  of  Fourier  transforms,  and  is  an  approximation 
to  the  Dirac  8  function.  Note  also  that  Corollary  4(ii)  of  Sriperumbudur,  Gretton,  et  al.  (2010) 
shows  that  as  T  — >  00,  the  mmd  based  on  k  converges  to  the  appropriate  rescaling  constant  times 
the  L2  distance,  independently  confirming  that  the  Lo  embedding  asymptotically  works. 


4.3  Information-theoretic  distances 

We  will  now  show  how  to  extend  this  general  approach  to  a  class  of  information  theoretic  distances 
that  includes  tv,  js,  and  squared  Hellinger  (Sutherland,  J.  B.  Oliva,  et  al.  2016).  We  consider  a 
class  of  metrics  that  we  term  homogeneous  density  distances  (hdds): 

p20,  q)  =  /  K(p(x),  q(x))  dx  (4.7) 

J[0,l]d 

where  k  :  R+  x  R+  — »  R+  is  a  1-homogenous  negative-definite  function2.  That  is,  <(tx,ty)  = 
ti<(x,y )  for  all  t  >  0,  and  there  exists  some  Hilbert  space  where  \\x  -  y||2  =  k(x,  y).  This  class 
was  studied  by  Fuglede  (2005);  Table  4.1  shows  some  important  instances. 

Our  embedding  will  take  three  steps: 

Embedding  hdds  into  L2  We  define  a  random  function  ifj  such  that  p(p,  q)  «  ||t {/(p)  -  <A(z/) 1 1 , 
where  i//(p)  is  a  function  from  [0,  \\d  to  R2M.  Thus  the  metric  space  of  densities  with  dis¬ 
tance  p  is  approximately  embedded  into  the  metric  space  of  2 Af -dimensional  L2  functions. 

Finite  Embeddings  of  L2  We  use  the  approach  of  Section  4.2  to  approximately  embed  smooth 
L2  functions  into  finite  vectors  in  Rjyl .  Combined  with  the  previous  step,  we  obtain  features 
A(p)  6  R2Mlyl  such  that  p  is  approximated  by  Euclidean  distances  between  the  A(-)  features. 

2Sometimes  referred  to  as  a  negative-definite  kernel. 
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Name 


k(p(x),  q(x )) 


dp{A) 


Jensen-Shannon  (js) 
Squared  Hellinger  (h2) 


M.  jpg  pPW  \  +  g(£)  lQg  /  2q(x)  3] 

2  lli&\p(x)+q(x))  +  2  W£\p(x)+q(x)) 


d/l 


\  (VpW- 
IpO)  - 


p(x)+q(x)  J  cosh(^/l)(l+d2) 

±d(A  =  1)<U 

2  d/t 


Total  Variation  (tv)  [KVV  yvvi  ^  1+4T2 

Table  4.1:  Various  squared  hdds;  d//  will  be  defined  shortly. 

Embedding  rbf  Kernels  into  RD  We  use  random  Fourier  features  z(-)  so  that  inner  products 
between  z(A(-))  features,  in  RD,  approximate  K(p,  q). 

Vedaldi  and  Zisserman  (2012)  studied  embeddings  of  a  similar  class  of  kernels,  but  only  for 
discrete  distributions  (e.g.  histograms).  Their  approach  was  basically  analogous  to  ours,  but  uses 
a  fixed  sampling  scheme  rather  than  the  random  one  we  employ  to  approximate  k,  and  the  L2 
embedding  step  is  trivial  in  their  setting  since  they  operate  componentwise.  We  compare  to  their 
approaches  in  Section  5.2,  a  case  in  which  the  histogram  assumption  harms  the  convergence  of 
the  estimator  significantly  with  low  sample  sizes,  but  allows  for  faster  computation. 

Our  embedding  proceeds  as  follows.  Fuglede  (2005)  shows  that  k  corresponds  to  a  unique 
bounded  measure  p(A),  shown  in  Table  4.1,  by 

*(x,y)=  f  \x\+iA-yl2+iA\2dp(A). 

«/ R>n 


The  following  is  equivalent,  but  makes  it  easier  to  find  p: 

k(x,  1  lx)  =  Zx  +  Z-  -2  [  cos(2d  log  x)  d p(A). 

X  J  R.>0 

Let  Z  :=  p(R>o)  so  that  p/Z  is  a  distribution;  also  define  cx  :=  (-^  +  i A)/{\  +  i/1).  Then 
k(x,  y)  =  |g/}(x)  -  g^O)!2  where  g/t(.r)  :=  VZc 


(4.8) 


\CX  \X2+U  -  1 


We  approximate  the  expectation  with  an  empirical  mean.  Let  Aj 

M 

M 


for  j  G  {1, . .  .,M}\  then 


1  M 


j= 1 

Hence,  the  squared  hdd  is,  letting  '}?,  3  denote  the  real  and  imaginary  parts: 

P2(A  q)=  k(p(x),  q(x))  dx 

J[  o,t]d 


/ 

J  ro 


[0,1] 

M 


gA(p(x))  -  gA(q(x))l~  dx 

d  z 


1  ^  /»  /  2  r 
~M  J[0  n  ,  ~  (?(*))))  +  \  3(gAj(p(x)))  - 


dx 
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(4.9) 


i  M 

=  M  IX 

7  =  1 


vl\ 


+ 


\\A 


vi,  I 


where 

pf(-*)  :=  5RfeOW)),  p\(x)  :=  3fe(pM))- 

Each  px  function  is  in  L2([0,  l]rf),  so  we  can  approximate  the  Gaussian  rbf  kernel  based  on  p, 
cxp(-yp2(p,  £/)),  as  in  Section  4.2:  let 

A(P)  :=  -1=  (a(pJ)T, ^p',)7, .  •  •XtpJjVtp^)7) 

so  that  the  kernel  is  estimated  by  z(A(P)). 

However,  the  projection  coefficients  of  the  p,\  functions  do  not  have  simple  forms  as  before; 
instead,  we  must  directly  estimate  the  density  as  p  using  a  technique  such  as  kernel  density 
estimation  (kde)  and  then  estimate  a(pA)  for  each  A  with  numerical  integration.  Recall  that  the 
elements  of  A(p)  are  of  the  form 

Ga  (A!)  =  f  (Pa(t)pS{  (t)dt 

v  1 '  J[o,i]d 

where  j  e  {1  S  e  {R,I},  a  e  V.  For  small  d,  simple  Monte  Carlo  integration  is 

sufficient.  Choosing  Unif  ([0,  l]r): 


rie 

da  =  —  J]  ifiaiUi)  pSAj{Ui\  (4.10) 

e  i=  1 

giving  us  an  estimate  of  A(p)  which  we  call  A(p). 

In  higher  dimensions,  three  problems  arise:  (i)  density  estimation  becomes  statistically  diffi¬ 
cult,  (ii)  accurate  numerical  integration  becomes  expensive,  and  (iii)  the  embedding  dimension 
increases  exponentially.  We  can  attempt  to  address  (i)  with  sparse  nonparametric  graphical 
models  (Lafferty  et  al.  2012)  or  other  high-dimensional  density  estimation  techniques  (Sripe- 
rumbudur,  Fukumizu,  Kumar,  et  al.  2013).  Point  (ii)  could  be  handled  with  mcmc  integration; 
high-dimensional  multimodal  integrals  remain  particularly  challenging  to  current  mcmc  tech¬ 
niques,  but  some  progress  is  being  made  (e.g.  Betancourt  )  1 5 ;  Fan  et  al.  give  a  heuristic 
algorithm).  Challenge  (iii)  requires  some  changes  to  the  algorithm  to  address,  as  it  does  for 
Section  4.2. 

Summary  and  Complexity  The  algorithm  for  computing  random  features  {z(A(p,))}^1  for  the 
generalized  rbf  kernel  based  on  an  hdd  p  among  a  set  of  distributions  given  sample  sets 

{Z,}J1  where  X,  =  {xf  e  [0,  l]d}n/=l  ~  /J„  is  thus: 

1.  Draw  M  scalars  Aj  y  and  D/2  vectors  cor  M0,  cr~2I2M\v\),  in  0(M  \V\  D )  time. 

2.  For  each  of  the  N  input  distributions  i: 
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(a)  Compute  a  kde  from  Xj,  Pi(uj )  for  each  uj  in  (4.10),  in  0(njne)  time. 

(b)  Compute  A(p,)  using  a  numerical  integration  estimate  as  in  (4.10),  in  0(M  \  V\  ne ) 
time. 

(c)  Get  the  random  Fourier  features,  z(A(p,)),  in  0(M  \V\  D )  time. 

Supposing  each  77/  x  n,  this  process  takes  a  total  of  O  ( Nnne  +  NM  \  V\ne  +  NM  |  V|  D)  time. 
Taking  |V[  to  be  asymptotically  0(n),  ne  -  0(D),  and  M  =  0(  1)  for  simplicity,  this  is  O(NnD) 
time,  compared  to  about  0(N2n\ogn  +  A3)  for  using  the  Cnn  estimator  for  divergences  with 
corrections  for  indefiniteness,  or  0(N2rr)  for  using  the  quadratic-time  mmd  estimator  (as  in 
Muandet,  Scholkopf,  et  al.  2012). 


4.3.1  Convergence  bound 

We  bound  the  finite- sample  error  of  our  estimator  for  fixed  densities  p  and  q  by  considering  each 
source  of  error:  kernel  density  estimation  (skde);  approximating  p(X)  with  M  samples  (sx)\ 
truncating  the  tails  of  the  projection  coefficients  (etaii);  Monte  Carlo  integration  (eint);  and  the 
rks  embedding  (srks)- 

Proposition  4.10.  Fix  p  and  q  as  two  densities  supported  on  [0,  1  j d  satisfying  some  smoothness 
assumptions:  that  they  are  members  of  a  periodic  Holder  class  'Lper(fS,  Lp)  for  some  /3,  Lp  >  0, 
that  they  are  bounded  below  by  p*  and  above  by  p*,  and  that  their  kernel  density  estimates  are  in 
£ Per(y ,  L)  for  some  y,L  >  0  with  probability  at  least  1  -  5.  Suppose  we  observe  n  samples  from 
each. 

We  will  use  the  estimator  of  Section  4.3  with  a  suitable  form  of  kernel  density  estimation  to 
obtain  a  uniform  error  bound  with  a  rate  based  on  a  function  C-1  (Gine  and  Guillou  2).  We 
use  the  Fourier  basis  and  choose  V  =  {a  £  Ze  \  Ylj=\\aj?'s  ^  t)  for  parameters  0  <  s  <  y,  t  >  0. 

Then,  for  any  sRKS  +  — (skde  +  Fi  +  Fail  +  Fnt)  <  £•' 


Pr  (| K(p,q)  -  z(A(p))Tz(A(p))|  >  e)  <  2  exp  +  2  exp  (-Me^/(8Z2))  +  <5 

,2/77(2 p+d)  \ 


+  2  C" 


s4  n~ 
t,KDEn 


4  log  77 


+  2M  (l  -  p([0,  Utail))) 


( 


%M  |V|  exp 


/ 


-\ne 


\ 


fl  +  4,7(8 1 V|Z)-  if 


Vp*  + 1 


where  utaii  :=  (0,  ^^4 ;7  -  ?)• 

For  a  more  detailed  statement  and  the  proof,  see  Appendix  C.l. 

The  bound  decreases  when  the  function  is  smoother  (larger  ft,  y;  smaller  L)  or  lower¬ 
dimensional  (d),  or  when  we  observe  more  samples  (n).  Using  more  projection  coefficients 
(higher  t  or  smaller  s,  giving  higher  |Vj)  improves  the  approximation  but  makes  numerical 
integration  more  difficult.  Likewise,  taking  more  samples  from  p  (higher  M)  improves  that 
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approximation,  but  increases  the  number  of  functions  to  be  approximated  and  numerically  inte¬ 
grated. 

4.3.2  Generalization  to  ct-hdds 

Corollary  1  of  Fuglede  (  005),  the  core  of  our  previous  embedding,  actually  applies  to  a  broader 
class  of  functions.  Let  an  q'-hdd  be  an  hdd  whose  k  is  a-homogeneous,  in  the  sense  that 
K{tx,ty )  =  taK(x,y).  Thus  the  hdds  discussed  previously  are  1-hdds.  The  embedding  is  just  as 
before,  except  that 

g[a)(x)  :=  <Z~}a  +  lA  lxla+U  -  l)  ,  (4.11) 

+ id  \  / 

and  so  of  course  the  p, \  functions  are  altered  accordingly  as  well.  The  equivalent  of  (4.8)  is 

k(x,  l/x)  =  Zxa  +  Zx~a  -  2  f  cos(2/llog x)dp{A).  (4.12) 

"  R>o 

For  example,  L2  is  a  2-hdd  defined  by  k(x,  y)  =  (x  -  y)2;  of  course,  k  is  negative-definite. 
Note  that,  using  (4.12),  k(x,  l/x)  =  (x  -  l/x)2  =  x2  +  x~2  -  2  so  that  p(A)  =  6(A  =  0),  and 

g^2)(x)  :=  1-x, 

so  (using  M  =  1)  the  embedding  (4.9)  becomes  simply 

p2(p,  q)  =  ||(1  -  p)  -  (1  -  «)||2  +  ||0 -0||2  =  \\p  -  9||2. 

Proposition  4.10  could  be  extended  to  q'-hdds  without  too  much  difficulty. 


4.3.3  Connection  to  mmd 

2-hdds  are  defined  by,  combining  (4.7)  and  (4.1 1): 


P2(P,<1)=  [  f  \p(x)1+u  -  q{x)x+uf  dyu(T) 

J X  yl>o 


dx 


[  [\  pW 

«/R>o  J X 


1+i/l 


-  q(x) 


1+LII- 


dx  dyu(/l) 


[  [  p(x)eU]0gp(x)  -  q(x)eU'0gc/(x) 

«/ R>o  X 


dx  dyu(/l). 


Meanwhile,  Corollary  4(i)  of  Sriperumbudur,  Gretton,  et  al.  (2010)  establishes  that  when  k  is  a 
continuous  shift-invariant  kernel  on  X  c  Rd  and  Q  the  Fourier  transform  of  k  : 


mmd (P,  QY 


r 

JRd 

f  f  p(x)ei0)Jxdx  -  f 
JY:d  JX  JX 


Ex~p[^'A]-Ey.e[^  7] 


dQ(m) 


q(x)eia>  Xdx 


d£2(m). 


This  similarity  in  form  is  appealing,  but  a  deeper  connection  between  the  two  is  elusive. 
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Chapter  5 

Applications  of  distribution  learning 


We  now  turn  to  case  studies  in  applying  distributional  kernels  to  real  machine  learning  tasks: 

•  Section  5.1  employs  distribution  regression  to  predict  the  total  mass  of  galaxy  clusters 
in  observationally  realistic  settings.  (Results  previously  published  in  Ntampaka,  Trac, 
Sutherland,  Battaglia,  et  al.  015;  Ntampaka,  Trac,  Sutherland,  Fromenteau,  et  al.  in 

press.) 

•  Section  5.2  examines  the  scalability  of  distribution  embeddings  on  a  synthetic  problem  of 
predicting  the  number  of  components  in  a  Gaussian  mixture  (Sutherland,  J.  B.  Oliva,  et  al. 

2016). 

•  Section  5.3  studies  scene  recognition  in  natural  images.  Section  5.3.1  uses  full-Gram 
matrix  techniques  with  sift  features  (Poczos,  Xiong,  Sutherland,  et  al.  20  li ;  Sutherland, 
Xiong,  et  al.  2012);  Section  5.3.2  uses  distribution  embeddings  with  deep  learning-derived 
features  (Sutherland,  J.  B.  Oliva,  et  al.  2016). 

•  Section  5.4  applies  distribution  regression  to  the  photons  observed  by  a  small  backpack¬ 
sized  sensor  to  identify  potentially  harmful  sources  of  radiation  (Jin  et  al.  2016). 


5.1  Dark  matter  halo  mass  prediction 

Galaxy  clusters  are  the  most  massive  gravitationally  bound  system  in  the  universe,  containing 
up  to  hundreds  of  galaxies  embedded  in  dark  matter  halos.  Their  properties,  especially  total 
mass,  are  extremely  useful  for  making  inferences  about  fundamental  cosmological  parameters, 
but  because  they  are  composed  largely  of  dark  matter,  measuring  that  mass  is  difficult. 

One  classical  method  is  that  of  Zwicky  (193).  The  virial  theorem  implies  that  the  dispersion 
of  velocities  in  a  stable  system  should  be  approximately  related  to  the  halo  mass  as  a  power  law;  by 
measuring  the  Doppler  shift  of  spectra  from  objects  in  the  cluster,  we  can  estimate  the  dispersion 
of  velocities  in  the  direction  along  our  line  of  sight,  and  thus  predict  the  total  mass.  Zwicky’s 
estimate  famously  led  him  to  the  first  formal  inference  about  the  presence  of  dark  matter. 

Experimental  evidence,  however,  points  towards  various  complicating  factors  that  disturb 
this  idealized  relationship,  and  indeed  results  based  on  numerical  simulation  have  shown  that 
the  predictions  from  this  power  law  relationship  are  not  as  accurate  as  we  would  hope.  We 
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(a)  Power  law  results. 


(b)  skl  results  on  |vios|  features. 


Figure  5.1:  Performance  for  halo  mass  prediction,  for  power  law  (left)  and  distribution  regression 
(right)  approaches.  Each  test  projection  is  plotted  with  its  true  log  mass  on  the  horizontal  axis 
and  prediction  on  the  vertical  axis.  The  black  line  shows  perfect  predictions;  the  yellow  line  gives 
the  median  of  the  predicted  points,  the  darker  red  region  shows  68%  scatter,  and  the  lighter  red 
95%  scatter. 


can  therefore  consider  using  all  information  available  in  the  line-of-sight  velocity  distribution  by 
directly  learning  a  regression  function  from  that  distribution  to  total  masses,  based  on  data  from 
simulation. 

We  assembled  a  catalog  of  massive  halos  from  the  MultiDark  mdpl  simulation  (Klypin  et  al. 
201z).  The  catalog  contains  5  028  unique  halos.  Since  we  use  only  line-of-sight  velocities, 
however,  we  can  view  each  halo  from  multiple  directions.  For  hyperparameter  selection  and 
testing,  we  use  lines  of  sight  corresponding  to  three  perpendicular  directions;  for  training,  we 
additionally  use  projections  sampled  randomly  from  the  unit  sphere  so  as  to  oversample  the  rare 
high-mass  halos.  Different  projections  of  the  same  halo  are  always  assigned  to  the  same  fold  for 
cross-validation.  Ntampaka,  Trac,  Sutherland,  Battaglia,  et  al.  (2015)  give  a  precise  description 
of  the  details. 

We  then  use  the  skl  estimator  of  Q.  Wang  et  al.  (2009)  in  a  generalized  rbf  kernel  on  a  simple 
one-dimensional  feature  set  containing  only  the  magnitude  of  the  line-of-sight  velocity.  Figure  5. 1 
shows  results,  establishing  that  the  distribution  regression  technique  greatly  outperforms  the  power 
law.  The  power  law  achieves  a  root  mean  squared  error  (rmse)  of  0. 180,  whereas  the  distribution 
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learning  method  gets  0.118.  Ntampaka,  Trac,  Sutherland,  Battaglia,  et  al.  (2015)  also  considered 
other  featurizations,  which  performed  similarly  or  sometimes  slightly  better,  and  has  a  much  more 
thorough  analysis  of  the  results. 
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(a)  Power  law  results. 


(b)  skl  results  on  |vios|  features. 


Figure  5.2:  Performance  for  halo  mass  prediction  with  interlopers.  Same  format  as  Figure  5.1. 


These  results,  however,  differed  from  the  true  observational  setting  in  one  important  way:  we 
assumed  perfect  knowledge  of  cluster  memberships.  In  actual  observations,  we  would  not  know 
which  objects  belong  to  the  cluster  at  hand,  and  which  merely  happen  to  appear  nearby  from 
our  Earth-bound  observation  point.  Standard  practice  for  application  of  the  power  law-based 
approach  is  to  employ  complex  systems  for  estimating  which  objects  are  gravitationally  bound 
and  which  are  not.  Distribution  regression  with  the  skl  estimator,  however,  is  far  more  robust  to 
the  presence  of  these  interlopers  than  the  power  law  approach.  In  Ntampaka,  Trac,  Sutherland, 
Fromenteau,  et  al.  (in  press),  we  modified  the  catalog  to  use  a  very  simple  heuristic  for  choosing 
the  members  of  a  cluster  and  then  applied  the  same  prediction  techniques.  The  results  are  shown 
in  Figure  5.2;  the  rmse  of  the  power  law  is  now  an  enormous  0.434,  where  distribution  regression 
is  0.177  —  matching  the  performance  of  the  power  law  predictions  based  on  perfect  knowledge 
about  cluster  membership. 
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5.2  Mixture  estimation 


Statistical  inference  procedures  can  be  viewed  as  functions  from  distributions  to  the  reals;  we  can 
therefore  consider  learning  such  procedures.  Jitkrittum,  Gretton,  et  al.  (2015)  trained  MMD-based 
gp  regression  for  the  messages  computed  by  numerical  integration  in  an  expectation  propagation 
system,  and  saw  substantial  speedups  by  doing  so.  We,  inspired  by  J.  B.  Oliva,  Neiswanger,  et  al. 
(2014),  consider  a  problem  where  we  not  only  obtain  speedups  over  traditional  algorithms,  but 
actually  see  superior  results. 

Specifically,  we  consider  predicting  the  number  of  components  in  a  Gaussian  mixture.  We 
generate  mixtures  as  follows: 

1.  Draw  the  number  of  components  Yj  for  the  zth  distribution  as  Y{  ~  Unif{l, . . . ,  10}. 

2.  For  each  component,  select  a  mean  ~  Unif[-5,  5]2  and  covariance  Y.'fn  =  a^A^A^1  + 
B^\  where  a  ~  Unif[l,  4],  A^\u,  v)  ~  Unif[-1, 1],  and  is  a  diagonal  2x2  matrix  with 
B^\u,  u)  ~  Unif[0, 1], 

3.  Draw  a  sample  X ^  from  the  equally-weighted  mixture  of  these  components. 

An  example  distribution  and  sample  from  it  is  shown  in  Figure  5.3;  predicting  the  number  of 
components  is  difficult  even  for  humans. 

Density  with  9  Components  Sample  with  9  Components 


Figure  5.3:  Example  of  a  mixture  with  9  components  and  a  sample  from  it  of  size  n  =  200. 

We  compare  generalized  rbf  kernels  based  on  the  mmd,  L2,  and  hdd  embeddings  of  Chapter  4 
as  well  as  the  js  embedding  of  Vedaldi  and  Zisserman  (2012)  and  the  full  Gram  matrix  techniques 
of  Section  2.4  applied  to  the  skl  estimator  of  Q.  Wang  et  al.  (2009). 

Figure  5.4  presents  results  for  predicting  with  ridge  regression  the  number  of  mixture  com¬ 
ponents  Yj,  given  a  varying  number  of  sample  sets  X ),  with  |  A,|  e  {200,  800};  we  use  D  -  5  000. 
The  HDD-based  kernels  achieve  substantially  lower  error  than  the  L2  and  mmd  kernels  in  both 
cases.  They  also  outperform  the  histogram  kernels,  especially  with  |X,[  =  200,  and  the  kl  kernel. 
Note  that  fitting  mixtures  with  em  and  selecting  a  number  of  components  using  aic  (Akiake 
)  or  bzc  (Schwarz  1978)  performed  much  worse  than  regression;  only  aic  with  [A,  =  800 
outperformed  a  constant  predictor  of  5.5.  Linear  versions  of  the  Lo  and  mmd  kernels,  based  on 
(2.2)  instead  of  the  (2.3)  results  shown,  were  also  no  better  than  the  constant  predictor. 
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Figure  5.4:  Error  and  computation  time  for  estimating  the  number  of  mixture  components.  The 
three  points  on  each  line  correspond  to  training  set  sizes  of  4k,  8k,  and  16k;  error  is  on  the  fixed 
test  set  of  size  2k.  Note  the  logarithmic  scale  on  the  time  axis.  The  kl  kernel  for  sets  of  size  800 
and  16k  training  sets  was  too  slow  to  run.  Aic-based  predictions  achieved  rmses  of  2.7  (for  200 
samples)  and  2.3  (for  800);  bic  errors  were  3.8  and  2.7;  a  constant  predictor  of  5.5  had  rmse  of 
2.8. 
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The  hdd  embeddings  were  more  computationally  expensive  than  the  other  embeddings,  but 
much  less  expensive  than  the  kl  kernel,  which  grows  at  least  quadratically  in  the  number  of  distri¬ 
butions.  Note  that  the  histogram  embeddings  used  an  optimized  C  implementation  by  the  paper’s 
authors  (Vedaldi  and  Fulkerson  2008),  and  the  kl  kernel  used  the  optimized  implementation  of 
ski -groups,  whereas  the  hdd  embeddings  used  a  simple  Matlab  implementation. 


5.3  Scene  recognition 

Representing  images  as  a  collection  of  local  patches  has  a  long  and  successful  history  in  computer 
vision. 

5.3.1  sift  features 

The  traditional  approach  selects  a  grid  of  patches,  computes  a  hand-designed  feature  vector  such 
as  sift  (Lowe  2004)  for  each  patch,  possibly  appends  information  about  the  location  of  the  patch, 
and  then  uses  the  bow  representation  for  this  set  of  features.  We  will  first  consider  the  use  of 
distributional  distance  kernels  for  this  feature  representation. 

We  present  here  results  on  the  8-class  ot  scene  recognition  dataset  (A.  Oliva  and  Torralba 
2001);  the  original  papers  show  results  on  additional  image  datasets.  This  dataset  contains  8 
outdoor  scene  categories,  illustrated  in  Figure  5.5.  There  are  2  688  total  images,  each  about 
256  x  256  pixels. 


Figure  5.5:  The  8  ot  categories:  coast,  forest,  highway,  inside  city,  mountain,  open  country, 
street,  tall  building. 


We  extracted  dense  color  sift  features  (Bosch  et  al.  2008)  at  six  different  bin  sizes  using 
VLfeat  (Vedaldi  and  Fulkerson  2008),  resulting  in  about  1815  feature  vectors  per  image,  each  of 
dimension  384.  We  used  pca  to  reduce  these  to  53  dimensions,  preserving  70%  of  the  variance, 
appended  relative  y  coordinates,  and  standardized  each  dimension.  (The  paper  contains  precise 
details.) 

The  results  of  10  repeats  of  10-fold  cross-validation  are  shown  in  Figure  5.6.  Each  approach 
uses  a  generalized  rbf  kernel.  Here  bow  refers  to  vector  quantization  with  /c- means  (k  =  1  000), 
plsa  to  the  approach  of  Bosch  et  al.  (2006),  g-kl  and  g-ppk  to  the  kl  and  Hellinger  divergences 
between  Gaussians  fit  to  the  data,  gmm-kl  to  the  kl  between  Gaussian  mixtures  fit  to  the  data 
with  expectation  maximization  (computing  via  Monte  Carlo),  pmk  to  the  pyramid  matching 
kernel  of  Grauman  and  Darrell  (  00’ ),  mmk  to  the  mmk  with  a  Gaussian  base  kernel,  nph  to  the 
nonparametric  Hellinger  estimate  of  Poczos,  Xiong,  Sutherland,  et  al.  (  012),  and  npr-  to  the 
estimates.  The  horizontal  line  shows  the  best  previously  reported  result  (Qin  and  Yung  2010), 
though  others  have  since  slightly  surpassed  our  results  here. 
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5.3.2  Deep  features 

For  the  last  several  years,  however,  modern  computer  vision  has  become  overwhelmingly  based 
on  deep  neural  networks.  Image  classification  networks  typically  broadly  follow  the  architecture 
of  Krizhevsky  et  al.  (  01),  i.e.  several  convolutional  and  pooling  layers  to  extract  complex 
features  of  input  images  followed  by  one  or  two  fully-connected  layers  to  classify  the  images. 

The  activations  are  of  shape  nxhxw,  where  n  is  the  number  of  filters;  each  unit  corresponds 
to  an  overlapping  patch  of  the  original  image.  We  can  therefore  treat  the  activations  as  a  sample 
of  size  hw  from  an  /^-dimensional  distribution.  Wu  et  al.  (2016)  set  accuracy  records  on  several 
scene  classification  datasets  with  a  particular  method  of  extracting  features  from  distributions. 
That  method,  however,  resorts  to  ad-hoc  statistics;  we  compare  to  our  more  principled  alternatives 
here. 

We  consider  here  the  Scene- 15  dataset  (Lazebnik  et  al.  2006),  which  contains  4485  natural 
images  in  15  categories  based  on  location.  (It  is  a  superset  of  the  ot  dataset  previously  considered, 
but  is  available  only  in  grayscale.)  We  follow  Wu  et  al.  (  )  16)  in  extracting  features  from  the 
last  convolutional  layer  of  the  imagenet-vgg-verydeep-16  model  (Simonyan  and  Zisserman 
2015).  We  replace  that  layer’s  rectified  linear  activations  with  sigmoid  squashing  to  [0, 1]. 1  After 
resizing  the  images  as  did  Wu  et  al.  (2016),  hw  ranges  from  400  to  1  000.  There  are  512  filter 
dimensions;  we  concatenate  features  A(/5,)  extracted  from  each  independently. 

We  select  100  images  from  each  class  for  training,  and  test  on  the  remainder;  Figure  5.7 
shows  the  results  of  10  random  splits.  We  do  not  add  any  spatial  information  to  the  model,  unlike 

1  We  used  piecewise-linear  weights  such  that  0  maps  to  0.5,  the  90th  percentile  of  the  positive  observations  maps 
to  0.9,  and  the  10th  percentile  of  the  negative  observations  to  0.1,  for  each  filter. 
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Figure  5.7:  Mean  and  standard  deviation  accuracies  on  the  Scene- 1 5  dataset.  The  left,  black  lines 
show  performance  with  linear  features;  the  right,  blue  lines  show  generalized  rbf  embedding 
features.  D3  refers  to  the  method  of  Wu  et  al.  (2016).  mmd  bandwidths  are  relative  to  <x,  the 
median  of  pairwise  distances;  histogram  methods  use  varying  numbers  of  bins. 


Wu  et  al.  (2016);  still,  we  match  the  best  prior  published  performance  of  91.59  ±  0.48,  using  a 
deep  network  trained  on  a  large  scene  classification  dataset  (Zhou  et  al.  2014).  Adding  spatial 
information  brought  the  D3  method  of  Wu  et  al.  (2016)  slightly  above  92%  accuracy;  their  best 
hybrid  method  obtained  92.9%.  Using  these  features,  however,  our  methods  match  or  beat  mmd 
and  substantially  outperform  D3,  L2,  and  the  histogram  embeddings. 


5.4  Small-sensor  detection  of  radiation  sources 

Preventing  the  proliferation  of  nuclear  weapons  and  stopping  nuclear  terrorist  attacks  is  one  of  the 
prime  responsibilities  of  security  agencies.  Tactical  nuclear  weapons  are  very  portable  and  pose 
great  risks  to  urban  environments.  Radioactive  isotopes  stolen  from  medical  uses  pose  another 
threat.  Although  certain  border  check  points  can  afford  to  require  potential  threats  to  go  through 
large  and  expensive  detectors,  mobile  radiation  detectors  are  vital  for  finding  radiation  sources 
which  either  successfully  passed  through  those  choke  points  or  managed  to  avoid  them.  In  certain 
situations,  sensors  carried  by  pedestrians  in  a  backpack  are  a  promising  tactic  for  seeking  out 
these  sources.  Much  of  the  time,  however,  the  targets  are  relatively  weak,  potentially  shielded, 
and  masked  by  highly- variable  patterns  of  background  radiation,  especially  in  the  cluttered  urban 
environments  where  dirty  bombs  or  improperly  stored  radioactive  material  can  cause  the  most 
harm.  We  need,  therefore,  sophisticated  systems  which  can  detect  radioactive  sources  in  real  time 
while  also  maintaining  a  low  false  alarm  rate. 

With  small  sensors  such  as  those  considered  here,  the  strong  Compton  effect  makes  observa¬ 
tions  of  a  photon’s  energy  very  noisy:  high-energy  photons  are  often  measured  as  if  they  were 
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much  lower-energy.  Combined  with  the  fewer  total  photons  received  by  the  smaller  sensor,  this 
means  that  existing  detection  algorithms  are  typically  outperformed  in  this  setting  by  a  simple 
threshold  on  the  total  number  of  photons  observed,  regardless  of  energy. 

Instead,  we  can  invert  a  probabilistic  sensor  response  model  to  obtain  a  distribution  of  possible 
energies  corresponding  to  each  photon  we  observe.  Given  such  a  model,  we  use  a  simple  Monte 
Carlo  technique:  replace  each  of  the  n  observed  photon  energies  with  1  000  samples  from  the 
distribution  of  possible  true  energies  corresponding  to  the  observed  photon.  We  then  model  the 
background  distribution  of  radiation:  for  a  given  source  of  radiation,  say  Csl37,  we  pick  certain 
ranges  of  energy  corresponding  to  that  source  (based  on  the  signal-to-noise  ratio  compared  to 
typical  background  distributions).  Then  we  model  the  expected  behavior  within  those  energies 
by  performing  distribution  regression  from  the  other  energy  levels  to  the  total  number  of  photons 
received  in  the  high-SNR  energy  levels.  That  is,  we  predict  a  total  count  y  of  photons  in  the 
high-SNR  energy  levels  based  on  the  distribution  of  photon  energies  observed  at  all  other  energy 
levels.  The  likelihood  of  source  presence  is  then  determined  by  the  departure  from  the  prediction: 
^r,  where  y  is  the  observed  number  of  photons  in  those  regions  and  y  the  prediction. 

We  simulated  this  process  by  taking  background  data  from  the  RadMAP  dataset  (Quiter 
et  al.  2015),  which  comprises  four  hours  of  observations  of  the  misti  mobile  detection  vehicle 
(Mitchell  et  al.  2009)  in  the  Berkeley,  CA  area.  We  also  obtained  characterizations  of  source 
spectra  from  collaborators  Simon  Labov  and  Karl  Nelson  at  Lawrence  Livermore.  We  generated 
background  data  from  the  observations  made  in  the  relatively  large-sensor  RadMAP  data  by 
simulating  small-sensor  measurements  of  it;  we  trained  on  background  data,  then  evaluated  on 
both  distinct  background  samples  and  samples  where  observations  corresponding  to  the  source 
were  injected.  Details  are  given  by  Jin  (2016). 

Figure  5.8  shows  results  for  detecting  Csl37  sources  at  various  classification  thresholds.  At 
low  false  alarm  rates,  the  most  relevant  regime  since  true  sources  are  hopefully  quite  rare  in 
practice,  the  distribution  regression  method  substantially  outperforms  the  total  counts  algorithm; 
as  the  false  alarm  rate  is  allowed  to  increase,  total  counts  catches  up  but  never  outperforms 
distribution  regression. 

Figure  5 .9  shows  the  improvement  in  probability  of  detection  over  total  counts  at  the  10-3  false 
alarm  rate  across  40  different  sources.  The  majority  of  sources  are  better-detected  by  distribution 
regression  than  by  total  counts,  some  of  them  substantially  so.  Jin  (2016)  shows  that  distribution 
regression  performs  better  in  cases  where  the  source’s  energy  output  is  more  concentrated  in 
certain  energy  levels,  as  might  be  expected.  He  also  shows  that  the  improvement  is  consistent 
across  different  experimental  setups,  corresponding  to  varying  the  strength  of  the  source  and  the 
size  of  the  sensor. 
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Figure  5.8:  Receiver  operating  characteristic  for  different  detection  methods,  with  log-scale  for 
the  false  alarm  rate,  lmr  refers  to  list  mode  regression ,  the  distributional  regression  technique; 
cew  is  censored  energy  windowing ;  pca  to  background  subtraction  via  principal  components 
analysis;  random  shows  the  hypothetical  performance  of  a  random  classifier. 


Figure  5.9:  Pairwise  improvement  in  probability  of  detection  at  false  alarm  rate  0.001  for  40 
different  sources,  sorted  by  the  improvement. 
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Chapter  6 

hypothesis  tests 


:  V  x  V  — >  R  and  used  it  to  learn  a  classification 
function  /  :  V  — >  {-1, 1}  or  a  regression  function  /  :  V  — »  R.  Although  several  of  the  kernels 
we  have  studied  give  good  empirical  performance  on  a  variety  of  problems,  with  complex  forms 
of  data  we  must  first  choose  an  appropriate  feature  extraction  pipeline  (such  as  the  sift  or  deep 
network  features  for  the  images  in  Section  5.3).  Even  in  simple  situations,  we  often  have  a  family 
of  kernels  we  expect  to  work  well  but  must  pick  an  element  of  that  family,  e.g.  bandwidth  selection 
in  Gaussian  rbf  kernels.  The  field  of  kernel  learning  is  an  extensively  studied  but  challenging 
approach  to  this  form  of  problem  (Gonen  and  Alpaydin  201  ! ;  Z.  Yang  et  al.  2015;  J.  B.  Oliva, 
Dubey,  et  al.  2016;  Wilson  et  al.  2016).  Though  we  could  adapt  those  methods  to  distributional 
settings,  we  will  instead  study  here  the  related  problem  of  two-sample  testing. 

Specifically,  we  observe  samples  X  =  {jci,  . . . ,  xm }  ~  P'n  and  Y  =  {>’], . . . ,  yrn }  ~  Qm. 1  We 
wish  to  test  the  null  hypothesis  Hq  :  P  =  Q  versus  the  alternative  H\  :  P  t  Q.  This  problem 
has  many  important  applications  including  independence  testing  (Gretton,  Bousquet,  et  al.  2005), 
feature  selection  (Song  et  al.  2012),  modeling  of  neuroimaging  results  (Tao  and  Feng  2016),  data 
integration  and  automated  attribute  matching  (Gretton,  Borgwardt,  et  al.  2012),  and  guiding  the 
training  of  generative  models  (Dziugaite  et  al.  )  1 5 ;  Y.  Li  et  al.  >  15). The  problem  is  connected 
to  but  in  some  senses  easier  than  training  a  classifier  to  distinguish  P  from  Q  (Sriperumbudur, 
Fukumizu,  Gretton,  Lanckriet,  et  al.  2009). 

One  standard  approach  to  performing  these  tests  is  to  choose  a  kernel  k  :  X  x  X  — »  R  and  then 
use  a  test  statistic  based  on  an  estimate  of  mmd^  between  the  samples.  In  the  standard  hypothesis 
testing  framework,  we  choose  a  threshold  ca  as  the  (1  -  ar)th  quantile  of  the  distribution  of  the 
test  statistic  under  Hq,  and  reject  Hq  if  the  statistic  exceeds  the  threshold. 

Many  kernels,  including  the  Gaussian  rbf,  are  characteristic  (Fukumizu  et  al.  2008),  implying 
that  these  tests  are  consistent:  as  m  — >  oo,  the  power  (that  is,  the  probability  that  we  reject  Hq 
when  Hi  holds)  converges  to  1.  Thus,  given  unlimited  data  and  computational  budget,  we  can 
choose  k  as  an  arbitrary  characteristic  kernel.  In  practice,  however,  the  power  of  the  test  depends 
greatly  on  the  choice  of  kernel.  For  example,  if  we  select  k  from  the  family  of  Gaussian  rbf 
kernels,  a  bandwidth  too  different  from  the  scale  on  which  P  and  Q  differ  will  be  unable  to 
efficiently  detect  those  differences.  We  thus  need  a  criterion  with  which  to  select  a  kernel  from 

1  For  simplicity,  we  assume  here  that  the  sample  sizes  are  equal.  The  unequal  case  would  not  be  fundamentally 
more  difficult. 


Choosing  kernels  for 

So  far,  we  have  assumed  a  particular  kernel  k 
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some  family. 

Moreover,  in  high  dimensions,  even  detecting  shifts  in  the  means  of  distributions  becomes  very 
difficult  with  general-purpose  kernels  (Ramdas,  Reddi,  et  al.  2015).  In  structured  domains  like 
images,  simple  kernels  like  the  Gaussian  rbf  also  approximate  natural  notions  of  similarity  quite 
poorly  except  when  complex  featurizations  are  first  applied;  this  was  the  problem  encountered  by, 
for  example,  Dziugaite  et  al.  (2015).  Thus,  we  would  like  to  be  able  to  choose  complex  kernels 
capable  of  examining  the  distributions  in  ways  particular  to  the  domain  at  hand. 

In  this  chapter,  we  develop  a  criterion  for  estimating  the  power  of  a  kernel  k  on  a  particular 
two-sample  test.  This  criterion  is  differentiable,  so  that  we  can  optimize  it  even  when  we  use 
complex  structures  such  as  deep  networks  within  the  kernel. 


6.1  Estimators  of  mmd 

Before  discussing  the  kernel  choice  criterion  and  its  antecedents,  we  need  to  briefly  discuss  some 
different  choices  of  estimators  for  mmd. 


Pairwise  estimators  Perhaps  the  simplest  estimator  for  mmd  is  as  follows: 


MMD“(X,  Y) 


.  m  m  .  m  m 


i=  1  7=1 


/=!  y=l 


2 

m2 


m  m 


X)  2  *«•  Yi)- 

7=1  7=1 


This  is  the  exact  mmd  between  the  empirical  distributions  of  the  samples  X  and  Y.  Note, 
however,  that  the  first  two  sums  include  terms  of  the  form  k(Xj,  X,);  it  turns  out  these  bias 
the  estimator  upwards.  If  we  remove  them,  we  get  the  minimum  variance  unbiased  estimator 
(Gretton,  Borgwardt,  et  al.  2012): 
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The  following  estimator  is  very  similar:  it  has  slightly  higher  variance,  by  ignoring  terms  of  the 
form  k(Xj,  Yj),  but  allows  us  to  apply  the  theory  of  U-statistics  (Serfling  198  ,  Chapter  5)  more 
directly  to  the  estimator.  Let  Wi  :=  (Xj,  Yt).  Then: 


h(w,  w ')  :=  k(x,  x)  +  k(y,  y')  -  k(x,  y')  -  k(y',  x)  (6.1) 

:=  i  X  h(Wk  WO). 

V2)  i+j 


These  estimators  are  sometimes  referred  to  as  the  “quadratic-time”  estimators,  because  they 
take  0{m 2)  time  to  evaluate.  It  takes  0(m)  memory,  because  all  samples  must  be  stored  in 
memory. 
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Streaming  estimators  When  we  have  a  very  large,  perhaps  unbounded,  number  of  samples 
available  and  wish  to  perform  the  best  test  available  with  a  given  computational  budget,  or  perhaps 
when  performing  the  test  under  strict  memory  restrictions,  the  following  streaming  estimator  is 
useful.  Assume  for  the  sake  of  convenience  that  m  is  even. 


MMD"(X,  Y) 


2 

m 


Yj  h(Wi,Wi+ 1) 

i=l,3, 5 . m- 1 


where  we  again  use  the  h  function  from  (6.1).  Thus,  we  examine  pairs  of  inputs  at  a  time,  and 
once  we  have  evaluated  one  pair  we  can  forget  it  and  move  onto  the  next. 

This  estimator  is  useful  in  the  streaming  setting,  and  convenient  to  analyze  because  its  terms 
are  independent.  It  takes  0(m )  time  to  compute,  and  0(  1 )  memory.  They  are  sometimes  referred 
to  as  the  “linear-time  estimators”;  we  avoid  that  term,  however,  because  the  embedding-based 
estimators  also  take  linear  time. 

When  m  is  limited,  however,  the  streaming  estimator  is  far  less  efficient  than  the  pairwise 
estimators.  Ramdas,  Reddi,  et  al.  (2015)  show  that  even  for  testing  against  mean-shift  alternatives, 
the  asymptotic  power  in  the  low-signal-to-noise,  high-dimensional  regime  behaves  like  O(am)  for 
the  pairwise  estimator  and  <S>(cr\fm)  for  the  streaming  estimator,  so  that  approximately  nr  samples 
are  needed  for  the  streaming  estimator  to  have  equivalent  power  to  the  quadratic  estimator. 


Embedding-based  estimators  This  is  the  approach  of  Section  4.1:  assuming  we  have  an 
approximate  embedding  k(x,  y)  «  z(x)Jz(y),  and  letting  z(X)  =  Y  Y!'L\  z(Xt),  we  simply  have 

^2b(X,Y)*\\z(X)-z(Y)\\2. 

We  can  also  approximate  the  unbiased  estimator,  though  it  is  not  nearly  as  nice: 

2  ,  m 

mmkI(X,X)  *  m  \\z(X)\\2  -  —  yiMOII2 
m(m  -  1)  mz 

\  i= l 

MMD“(X,  Y)  w  MMK“(X,  X)  +  MMK“(T,  Y)  -  2z(X)Jz(Y). 

When  ||z(.r)||  =  1,  as  with  the  shift-invariant  embedding  z  of  (3.1),  mmk“(I,  X)  simplifies  to 
^-j-||z(A)||2  -  For  the  non- shift-invariant  embedding  z  (3.2),  this  is  true  only  in  expectation. 

We  could  similarly  approximate  mmd^  if  we  wished,  by  subtracting  off  the  terms  correspond¬ 
ing  to  z(Xi)Jz(Yi). 

Chwialkowski  et  al.  (2015)  studied  the  performance  of  two-sample  tests  using  these  estimators, 
and  found  that  although  their  performance  can  be  surprisingly  poor  in  certain  situations  (their 
Proposition  1),  a  related  class  of  tests  using  similar  embeddings  performs  well. 


6.2  Estimators  of  the  variance  of  mmd2 

Some  of  the  kernel  choice  criteria  we  will  develop  shortly  will  require  estimates  of  the  variance 

of  MMD2. 
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Streaming  estimator  The  asymptotic  distribution  for  mmDj  is  simple:  because  it  is  an  average 
of  independent  random  variables  the  central  limit  theorem  tells  us  that  under  either  the  null  or 
the  alternative, 

MMD?  -  MMD"  D 

s  MO,  1)  (6.2) 

where 

Vlm)  :=  |  (Evv,w,  h2(w,  w')  -  [EW;H,'  h(w,  w')]2)  .  (6.3) 

This  can  be  estimated  in  a  streaming  fashion  as: 


4 

m 


( h(Wi ,  Wi+i )  -  h(Wi+ 2,  lT/+3))2  . 

i=l,5,9,...,m-3 


(6.4) 


U-estimator  The  asymptotic  distribution  for  mmd^  is  complex  under  Ho,  and  we  will  resort  to 

permutation  tests  to  determine  the  test  threshold.  Under  H i,  however,  mmd^  is  asymptotically 
normal  (Gretton,  Borgwardt,  et  al.  2012): 


V 


v. 


(/;?) 


MMDy  -  MMD"  )  — ■»  MO,  1), 
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(6.5) 


with 


4(ot  ~  2)  2 

m(m  -  1)  1  m(m  -  1) 


(6.6) 


where  :=  Varv[EV'[/i(v, v')]]  and  ^2  :=  Var Vy[h(y,v')].  This  is  established  for  U-statistics 
in  general  by  Serfling  (1980,  Chapter  5);  the  analysis  here  was  partially  carried  out  for  mmd  in 
particular  in  Appendix  A  of  Bounliphone  et  al.  (2015).  Using  ip  to  denote  the  feature  map  of  the 
kernel  k  and  /ux  =  Ex  pix),  ny  =  Ev  (f(y),  we  have  that: 


ft  =EV[E  AKv,v')]2] 


MMD 


-*x,y 


{(<p(x),  fix)  +  (<p(y),  Hy )  -  {(fix),  Hy)  -  ( fix ,  <p(y))y 


-  MMD" 


Expanding  the  square,  we  get  an  (unpleasant)  expression  in  terms  of  expectations.  £2  can  be 
calculated  similarly. 

We  can  estimate  these  terms  based  on  a  sample  as  follows.  Let  Kxx  :=  [k(Xj,  Xj)].., 
KYy  ■=  [k(Yj,  Yj)].  j,  KXy  ■=  and  1  refer  to  the  all-ones  vector  of  length  m.  Let  KXx, 

Kyy,  Kxy  be  the  kernel  matrices  with  diagonal  elements  set  to  0.  Let  ||-||/.-  denote  the  Lrobenius 
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norm.  Then: 


ft  = 
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m(m  -  1  )(m  -  2) 


[KxxKxx  1  - 
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1T KxxKxyI  + 
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We  then  define 


V, 


( m ) 
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4(>»-2)  p  2  ; 

m(m— 1)=^  m(m— 1)^“ 


.  m(m-l) 


when  4(”7  + 

m\m—  1 J  ^ 

otherwise 


m(m- 1) 


&>0 


(6.7) 


6.3  mmd  kernel  choice  criteria 

We  now  suppose  that  we  have  some  class  of  kernels  fK.  and  would  like  to  choose  an  element 
k  G  %  with  which  to  conduct  our  test. 

In  general,  we  will  divide  the  observed  data  X  and  Y  into  two  partitions:  one  “training  sample” 
to  choose  the  kernel,  and  one  “testing  sample”  to  evaluate  the  test.  Doing  so  loses  some  statistical 
power,  but  the  test  statistic  distribution  becomes  quite  complicated  when  the  kernel  can  depend 
on  the  data. 

6.3.1  Median  heuristic 

Perhaps  the  most  common  criterion  for  choosing  k  applies  only  to  the  case  where  ‘K  is  the  family 
of  Gaussian  rbf  kernels  with  different  bandwidths.  This  heuristic  proposes  to  set  cr  to  the  median 
pairwise  distance  in  the  joint  sample  Xuf;  despite  its  simplicity,  it  performs  well  on  many 
problems. 

Reddi  et  al.  (20 1  - )  studied  its  theoretical  performance  in  high-dimensional  problems;  Ramdas, 
Reddi,  et  al.  (2015)  provide  some  theoretical  justification  in  the  particular  case  of  testing  for  mean- 
difference  alternatives  in  the  high-dimensional  regime. 
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6.3.2  Marginal  likelihood  maximization 

Flaxman,  Sejdinovic,  et  al.  (  )16)  propose  a  Bayesian  model  for  learning  kernel  embeddings, 
effectively  adding  a  Gaussian  Process  prior  to  the  estimator  of  the  mean  embedding.  Doing 
so  allows  for  a  fully-Bayesian  treatment  of  learning  the  corresponding  kernel,  and  they  give  an 
example  of  using  learned  kernels  on  testing  problems.  The  kernel  selection  criterion,  however, 
is  fully  unsupervised:  it  can  only  give  the  kernel  choice  that  best  describes  the  joint  data,  not 
one  that  best  distinguishes  between  the  two  datasets.  In  this  respect,  it  is  somewhat  similar  to  the 
median  heuristic,  though  it  has  some  ability  to  recognize  when  the  data  vary  on  multiple  scales. 

6.3.3  Maximizing  mmd 

Sriperumbudur,  Fukumizu,  Gretton,  Lanckriet,  et  al.  (2009)  proposed  choosing  the  kernel  k 
which  maximizes  mmda-(X,  Y).  They  showed  that,  for  certain  classes  of  kernels  7C,  the  resulting 
test  is  consistent;  additionally,  it  performs  well  empirically  on  many  problems. 

It  does  not  directly  optimize  the  test  power,  however:  increasing  the  mmd  estimate  often  also 
increases  its  variance  and  thus  the  required  test  threshold  to  exceed. 

6.3.4  Cross-validation  of  loss 

Gretton,  Sriperumbudur,  et  al.  (2012)  propose  as  a  method  of  comparison  to  choose  kernel  values 
via  cross-validation,  following  Sugiyama  et  al.  (201 1),  from  the  “classifier”  perspective. 

First,  Sriperumbudur,  Fukumizu,  Gretton,  Lanckriet,  et  al.  (2009)  establishes  the  following 
interpretation  of  mmd  as  a  classifier:  first,  define  the  witness  function  as  f  :=  up  -  Hq,  be. 


i= 1 


i=  1 


Note  that  /  :=  arg  sup^g^  E x~p  f'(X)  -  E y~q  f'(Y),  using  the  definiton  of  mmd  as  an  integral 
probability  metric.  Then,  we  can  view  sign(/)  as  a  Parzen  window  classifier  trained  with  the 
points  from  X  as  positives  and  from  Y  as  negatives.  The  mmd  is  then  the  negation  of  the  linear 
loss  function  for  that  classifier. 

Following  this  view  of  mmd,  one  can  choose  a  kernel  k  by  choosing  the  best  Parzen  window 
classifier  via  cross-validation.  That  is,  divide  the  data  into  K  folds,  and  then  for  each  fold,  learn 
a  witness  function  /  on  the  other  K  —  1  folds  and  evaluate  its  linear  loss  on  the  remaining  fold. 
Optionally,  repeat  this  process  for  several  splits.  Choose  the  kernel  with  the  lowest  linear  loss. 

This  process  requires  evaluating  each  training  set  against  the  validation  set,  so  that  even  when 
the  streaming  estimator  is  used,  quadratically  many  comparisons  must  be  made. 

This  method  is  actually  quite  similar  to  choosing  the  kernel  via  maximizing  the  mmd,  but 
with  a  cross-validated  estimate  of  mmd  rather  than  evaluating  on  only  one  set.  Strathmann  (2012) 
found  that,  in  certain  problems,  this  approach  outperformed  maximizing  the  mmd. 
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6.3.5  Cross-validation  of  power 

Strathmann  (  012)  proposes  another  method  for  applying  cross-validation  to  kernel  choice: 
directly  estimate  the  power  via  cross-validation.  Split  the  data  into  K  folds,  repeatedly  performing 
a  two-sample  test  on  K  -  1  of  the  folds  (and  ignoring  the  other  fold).  Repeat  the  data- splitting 
process.  Then,  choose  the  kernel  which  rejected  the  null  distribution  most  often. 

This  approach  can  be  performed  in  linear  time  when  using  mmdJ,  and  was  found  by  Strathmann 
(2012)  to  outperform  cross-validation  based  on  the  loss,  and  sometimes  the  /-statistic  approach 
discussed  shortly,  in  the  streaming  setting. 

6.3.6  Embedding-based  Hotelling  stastistic 

Jitkrittum,  Szabo,  et  al.  (2016)  showed  that  one  can  perform  kernel  selection  in  the  tests  of 
Chwialkowski  et  al.  (  :015)  simply  by  maximizing  the  test  statistic. 


6.3.7  Streaming  /-statistic 


Gretton,  Sriperumbudur,  et  al.  (2012)  analyzed  the  problem  of  choosing  a  kernel  for  mmDj.  Recall 
from  (6.2)  that 
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(m)  \  s 


)  ^  MO,  i), 


using  the  variance  vf”"1  from  (6.3),  which  is  the  same  under  the  null  and  the  alternative.  Thus, 


the  asymptotic  test  threshold  for  the  streaming  estimator  is  simply 

(l -a). 


where  <t>  is  the  cdf  of  a  standard  normal  random  variable.  Since  Vis'">  is  unknown  in  practice,  we 
instead  use  the  estimator  of  (6.4): 


4:=  a/vFo-1  0-a). 

The  asympotic  power  of  such  a  test  is,  using  Pi  //,  to  denote  probability  under  the  alternative  H\, 
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The  power  is  thus  asymptotically  maximized  when  mmd "/-v/V^  is  maximal.  In  practice,  we 


(in) 


as  it 


optimize  mmd,  /  -W  v!i  m> . 

We  will  call  this  quantity  ts  :=  mmd 2 /-\jVs'"\  and  its  estimator  ts  :=  mmdj/-^ 
follows  the  form  of  a  /-statistic  for  mmdj. 

Gretton,  Sriperumbudur,  et  al.  (2012,  Theorem  1)  proved  that,  when  considering  nonnegative 
combinations  of  a  fixed  set  of  base  kernels,  the  maximum  of  the  ratio  estimate  approaches  the 


maximum  of  the  ratio  at  a  rate  Op  yn  3  I  and  that  the  kernel  achieving  the  maximum  ratio 
estimate  converges  in  probability  to  the  kernel  achieving  the  maximum  ratio. 


6.3.8  Pairwise  f -statistic 


We  can  in  fact  make  a  similar  argument  for  mmd^. 

Under  Hq,  m  mmd";  converges  in  distribution  to  an  infinite  mixture  of  x1  random  variables, 
with  weights  depending  on  the  (unknown)  distributions  P  and  Q  as  well  as  k  (Gretton,  Borgwardt, 
et  al.  2012);  ca  is  thus  difficult  to  evaluate  in  closed  form.  We  can,  however,  estimate  a  data- 
dependent  threshold  ca  according  to  a  permutation  test:  randomly  partition  the  data  points  luf 
into  X'  and  Y'  many  times,  evaluate  mmd^(X',  Y')  to  approximate  the  null  distribution,  and  then 
estimate  the  (1  -  ar)th  quantile  ca  from  these  samples.2 

Under  the  alternative  H i,  however,  recall  from  (6.5)  that  the  distribution  is  asymptotically 
normal: 


MMDy  -  MMD2 
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MO,  i) 


using 


from  (6.6).  We  can  thus  compute  the  test  power  as: 


Defining 


Pi  //,  I  ITl  MMD?.  >  Ca 
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MMDy  -  MMD2  ca 
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mm»2u  _  ca 


2Gretton,  Fukumizu,  et  al.  (2009)  proposed  a  way  to  estimate  ca  without  permutation  tests  by  examining  the 
eigenvalues  of  the  data  Gram  matrix,  but  a  recent  cache-efficent  implementation  of  permutation  tests  in  the  Shogun 
toolbox  (Sonnenburg  et  al.  2010)  is  actually  significantly  quicker  to  compute  than  this  estimate.  We  thus  only 
consider  permutation  tests  here. 
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we  see  that  the  power  is  maximal  when  tjj  is  maximal.  In  practice,  we  maximize  its  estimator  f;j . 

But  note  that  ca  and  mmd  are  constant  as  the  sample  size  m  increases,  and  is  O  For 
large  m,  therefore,  the  first  term  dominates  the  second,  and  it  suffices  to  maximize  just  the  first 
term 

9  2 

MMD"  „  MMD" 

tjj  :=  or  its  estimator  tu  :=  . 

VC*  V97 

which  (appealingly)  is  of  the  same  form  as  the  /-statistic  in  the  streaming  setting.  When  the  test  is 
on  the  cusp  of  rejection,  however,  ca  ~  m  vivid2,  and  thus  the  two  terms  are  of  similar  magnitude; 
additionally,  using  tu  can  lead  to  asymptotic  power  predictions  no  smaller  than  We  will  see 
in  the  experiments  section  that  for  simple  problems,  while  tu  gives  inaccurate  estimates  of  the 
asymptotic  power  but  tu  s  are  reasonable,  the  maximum  of  tu  often  coincides  with  that  of  f u- 


Gradients  As  mentioned  previously,  complex  kernel  functions  are  far  more  powerful  in  some 
domains  than  simple  families  such  as  Gaussian  rbfs.  We  would  like  to  be  able  to  choose 
kernels  by,  for  example,  passing  inputs  through  a  deep  network  to  learn  a  representation,  and 
then  comparing  those  learned  representations  with  a  standard  kernel.  It  is  far  easier  to  optimize 
over  such  complex  kernel  classes,  however,  when  gradient  information  is  available.  It  is  thus 

important  to  note  that  tu  is  differentiable  in  k  :  mmd2,  is  an  average  of  applications  of  k,  and  v]  "' 1 
(6.7)  is  based  on  terms  of  a  similar  form.3 

In  fact,  we  can  also  obtain  stochastic  gradients  of  ca  with  respect  to  k.  Let  II  =  {nu . . . ,  n^} 
denote  the  set  of  permutations  applied  to  the  data,  and  X'n,  the  result  of  applying  one  of  those  per¬ 
mutations,  so  that  our  approximate  sample  from  the  null  distribution  is  {rjni  :  =  mmd2,(A^,  Y^  )}  '.'_  1 . 
Let  7  be  the  nearest  integer  to  (1  -  a)N,  and  j  be  the  index  of  the  permutation  achieving  the  7th 
largest  i]n  value.  Then  cj,111  =  tjn.  is  the  test  threshold  corresponding  to  the  set  of  permutations 

II.  The  gradient  of  c(l 1 1  by  k  is  simply  the  gradient  by  k  of  i)nj.  But,  assuming  that  mmd))  is 
Lipschitz  in  the  parameterization  of  k,  the  Leibniz  rule  tells  us  that 


En 


,(n) 
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Encln) 


Vkca. 


6.4  Experiments 

We  will  now  study  the  effectiveness  of  maximizing  f u  and  tu  versus  that  of  maximizing  the  mmd 
on  synthetic  problems.  Further  experiments  on  more  realistic  problems  are  left  to  future  work. 

We  do  simple  bandwidth  selection  for  Gaussian  rbf  kernels.  For  each  pair  of  distributions, 
we  draw  100  samples  (A,  y)  and  compute  the  criteria  and  run  a  permutation  test  for  each  of  30 
logarithmically-spaced  values  for  cr  from  10-1'7  to  10L7.  We  use  1  000  permutations  in  the  tests, 
which  are  implemented  in  the  feature/bigtest  branch  of  Shogun  (Sonnenburg  et  al.  ). 

3The  gradient  is  quite  long  to  write  out,  but  it  is  amenable  to  automatic  differentiation  e.g.  in  Theano  (The  Theano 
Development  Team  et  al.  2016). 
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All  tests  use  an  allowed  false  positive  rate  of  10%.  In  the  results,  “best  choice”  refers  to  the 
bandwidth  with  the  maximal  empirical  power. 


6.4.1  Same  Gaussian 

In  this  situation,  the  null  distribution  holds:  P  =  Q  =  Af(0, 1).  Figure  6.1  verifies  that  the  stated 
false  positive  level  is  adhered  to  by  each  proposed  method. 

Note  that  “best  choice”  here  gives  a  test  slightly  larger  than  desired,  because  it  is  chosen  to 
maximize  the  rejection  rate  on  the  same  datasets  as  it  is  plotted  on. 
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Figure  6.1:  Same  Gaussian  problem:  mean  and  standard  deviation  of  test  powers  for  increasing 
dimension. 


6.4.2  Gaussian  variance  difference 

Here  we  test  P  =  N(0, 1)  versus  Q  =  N(0, 1  +  e\ ej).  (In  Q.  the  variance  of  the  first  dimension 
is  twice  that  in  the  other  dimensions.)  Figure  6.2  shows  results;  maximizing  the  mmd  actually 
slightly  outperforms  maximizing  tjj  or  %. 

Figure  6.3  breaks  down  the  difference  in  the  case  dl  -  2,  m  =  100.  We  can  see  that 
maximizing  the  mmd  usually  picked  banwdiths  near  the  peak  power,  whereas  iu  and  iu  often 
picked  bandwidths  either  somewhat  larger  than  the  peak  or  occasionally  much  smaller.  Figure  6.4 
shows  the  criteria  used  to  select  those  bandwidths,  including  their  asymptotic  values  based  on 

the  true  mmd  and  the  asymptotic  variance  of  the  mmd^  estimator  of  normal  distributions.  (For 
Tu,  we  used  the  mean  value  of  the  permutation-based  ca  across  repeated  draws  from  the  dataset 
for  the  asymptotic  value  of  ca.) 

We  can  see  here  that  the  difference  in  performance  is  not  just  a  poor  variance  estimate,  but 
that  the  asymptotic  values  of  t\j  and  especially  tjj  are  less  suited  to  bandwidth  selection  here 
than  simply  maximizing  the  mmd.  Given  the  lack  of  theory  about  the  power  of  tests  based  on 
maximizing  the  mmd,  this  difference  is  somewhat  difficult  to  explain  further. 
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(a)  m  =  100 


(b)  m  =  500 


Figure  6.2:  Gaussian  variance  difference  problem:  mean  and  standard  deviation  of  test  powers 
for  increasing  dimension. 


Figure  6.3:  Chosen  bandwidths  for  the  three  methods  for  Gaussian  variance  difference  for  d  =  2, 
m  =  100.  Vertical  gray  lines  represent  the  candidate  bandwidths,  in  log  scale;  bars  show  the 
number  of  times  each  bandwidth  was  chosen.  The  gray  dashed  line  shows  the  empirical  power 
of  each  bandwidth,  so  that  e.g.  the  central  bandwidth  1  achieved  power  about  0.7. 


(a)  mmd^  (b)  fu  (C)  tjj 


Figure  6.4:  The  various  critera  for  the  Gaussian  variance  difference  problem.  In  each  figure, 
the  blue  line  shows  the  median  of  the  estimator,  darker  blue  region  68%  scatter,  and  lighter 
blue  region  95%  scatter;  thick  red  lines  show  the  asymptotic  value  of  the  quantity  in  question. 
On  a  separate  vertical  scale  (not  labeled),  gray  dashed  lines  show  the  empirical  power  of  each 
bandwidth,  so  that  e.g.  the  central  bandwidth  1  achieved  power  about  0.7. 
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6.4.3  Blobs 


We  now  consider  the  blobs  problem  of  Gretton,  Sriperumbudur,  et  al.  (2012):  P  is  a  5  x  5  grid 
of  two-dimensional  standard  normal  components,  with  spacing  10  between  the  centers.  Q  is  laid 

1  — 

1  £+1 


out  identically,  but  each  mixture  component  is  N  Fu, 


s=!  1 

U+l  1 


in  its  variance  is  s.  Note  that  at  s  =  1,  P  =  Q.  An  example  grid  is  shown  in  Figure  6.5. 


,  so  that  the  ratio  of  eigenvalues 


-> 


Figure  6.5:  A  sample  from  the  Blobs  problem,  with  m  =  500,  s  =  6. 


Figure  6.6  shows  results;  here,  t\j  and  tjj  each  outperform  mmd,  especially  when  m  =  500, 
and  are  nearly  optimal. 

We  again  take  a  closer  look  at  the  criteria,  here  where  e  -  6,  m  -  500.  Figure  6.7  shows 
the  selected  bandwidths;  we  can  see  that  in  this  case,  maximizing  the  mmd  usually  either  picked 
bandwidths  slightly  too  large  or  sometimes  much  too  large,  whereas  t\j  and  %  both  consistently 
selected  bandwidths  around  the  peak  power. 

Figure  6.8  shows  the  criteria  used  to  select  those  bandwidths.  Here,  although  asymptotic 
values  of  the  variance  are  available,  500  samples  (on  expectation,  only  20  per  blob)  is  not  enough 
for  it  to  converge  well  to  its  asymptotic  value.  Thus  we  use  the  empirical  variance  of  the  mmd 
estimator  across  our  repeated  dataset  samples  instead.  We  can  see  that  in  this  case,  although  the 
true  mmd  peak  is  not  too  bad  (it  is  only  a  little  large),  for  large  bandwidths  the  mmd  estimator  has 
a  very  high  variance,  and  thus  maximizing  the  mmd  often  picks  a  very  largue  bandwidth  value. 
tu  and  T(j,  by  contrast,  both  asymptotically  peak  in  the  correct  location  and  their  estimates  do 
not  vary  too  widely  other  than  in  the  cases  where  the  mmd  blows  up,  in  which  case  its  variance 
increases  even  more  and  so  an  already-bad  location  only  seems  worse  than  it  really  is. 
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(a)  m  =  100 


(b)  in  =  500 


Figure  6.6:  Blobs  problem:  mean  and  standard  deviation  of  test  powers  for  increasing  eigenvalue 
ratio. 


MMD  tu  Tu 


Figure  6.7:  Chosen  bandwidths  for  the  three  methods  for  the  Blobs  problem  with  s  =  6,  m  =  500. 
Figures  as  in  Figure  6.3. 


Figure  6.8:  The  various  critera  for  the  blobs  problem.  Figures  as  in  Figure  6.4,  except  that  red 
lines  use  empirical  variance  across  the  samples  rather  than  asymptotics. 
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Chapter  7 

Active  search  for  patterns 


We  will  now  change  focus  slightly,  and  consider  another  problem  setting  in  which  collections  of 
data  play  a  key  role. 

Consider  a  function  containing  interesting  patterns  that  are  defined  only  over  a  region  of 
space.  For  example,  if  you  view  the  direction  of  wind  as  a  function  of  geographical  location, 
it  defines  fronts,  vortices,  and  other  weather  patterns,  but  those  patterns  are  defined  only  in  the 
aggregate.  If  we  can  only  measure  the  direction  and  strength  of  the  wind  at  point  locations,  we 
then  need  to  infer  the  presence  of  patterns  over  broader  spatial  regions. 

Many  other  real  applications  also  share  this  feature.  For  example,  an  autonomous  environ¬ 
mental  monitoring  vehicle  with  limited  onboard  sensors  needs  to  strategically  plan  routes  around 
an  area  to  detect  harmful  plume  patterns  on  a  global  scale  (Valada  et  al.  2012).  In  astronomy, 
projects  like  the  Sloan  Digital  Sky  Survey  (Eisenstein  et  al.  )  search  the  sky  for  large-scale 
objects  such  as  galaxy  clusters.  Biologists  investigating  rare  species  of  animals  must  find  the 
ranges  where  they  are  located  and  their  migration  patterns  (Brown  et  al.  2014).  We  aim  to  use 
active  learning  to  search  for  such  global  patterns  using  as  few  local  measurements  as  possible. 

This  bears  some  resemblance  to  the  artistic  technique  known  as  pointillism,  where  the  painter 
creates  small  and  distinct  dots  each  of  a  single  color,  but  when  viewed  as  a  whole  they  reveal 
a  scene.  Pointillist  paintings  typically  use  a  denser  covering  of  the  canvas,  but  in  our  setting, 
“observing  a  dot”  is  expensive.  Where  should  we  make  these  observations  in  order  to  uncover 
interesting  regions  as  quickly  as  possible? 

We  propose  a  probabilistic  solution  to  this  problem,  known  as  active  pointillistic  pattern 
search  (apps).  We  assume  we  are  given  a  predefined  list  of  candidate  regions  and  a  classifier 
that  estimates  the  probability  that  a  given  region  fits  the  desired  pattern.  Our  goal  is  then  to 
find  as  many  regions  that  are  highly  likely  to  match  the  pattern  as  we  can.  We  accomplish  this 
by  sequentially  selecting  point  locations  to  observe  so  as  to  approximately  maximize  expected 
reward. 


7.1  Related  work 

Our  concept  of  active  pattern  search  falls  under  the  broad  category  of  active  learning  (Settles 
2012),  where  we  seek  to  sequentially  build  a  training  set  to  achieve  some  goal  as  fast  as  possible. 
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Our  focus  solely  on  finding  positive  (“interesting”)  regions,  rather  than  attempting  to  learn  to 
discriminate  accurately  between  positives  and  negatives,  is  similar  to  the  problem  previously 
described  as  active  search  (Garnett  et  al.  2012).  In  previous  work  on  active  search,  however,  it 
has  been  assumed  that  the  labels  of  interest  can  be  revealed  directly.  In  active  pattern  search,  on 
the  other  hand,  the  labels  are  never  revealed  but  must  be  inferred  via  a  provided  classifier.  This 
indirection  increases  the  difficulty  of  the  search  task  considerably. 

In  Bayesian  optimization  (Osborne  et  al.  2009;  Brochu  et  al.  2010),  we  seek  to  find  the  global 
optimum  of  an  expensive  black-box  function.  Bayesian  optimization  provides  a  model-based 
approach  where  a  Gaussian  process  (gp)  prior  is  placed  on  the  objective  function,  from  which  a 
simpler  acquisition  function  is  derived  and  optimized  to  drive  the  selection  procedure.  Tesch  et  al. 
(2013)  extend  this  idea  to  optimizing  a  latent  function  from  binary  observations.  Our  proposed 
active  pattern  search  also  uses  a  Gaussian  process  prior  to  model  the  unknown  underlying  function 
and  derives  an  acquisition  function  from  it,  but  differs  in  that  we  seek  to  identify  entire  regions 
of  interest,  rather  than  finding  a  single  optimal  value. 

Another  intimately  related  problem  setup  is  that  of  multi-arm  bandits  (Auer  et  al.  2002),  with 
more  focus  on  analysis  of  the  cumulative  reward  over  all  function  evaluations.  Originally,  the 
goal  was  to  maximize  the  expectation  of  a  random  function  on  a  discrete  set;  a  variant  considers 
the  optimization  in  continuous  domains  (Kroemer  et  al.  0;  Niranjan  et  al.  >).  However, 
like  Bayesian  optimization,  multi-arm  bandit  problems  usually  do  not  consider  discriminating  a 
regional  pattern. 

Level  set  estimation  (Low  et  al.  2012;  Gotovos  et  al.  2013),  rather  than  finding  optima  of  a 
function,  seeks  to  select  observations  so  as  to  best  discriminate  the  portions  of  a  function  above 
and  below  a  given  threshold.  This  goal,  though  related  to  ours,  aims  to  directly  map  a  portion  of 
the  function  on  the  input  space  rather  than  seeking  out  instances  of  patterns,  lse  algorithms  can 
be  used  to  attempt  to  find  some  simple  types  of  patterns,  e.g.  areas  with  high  mean. 

apps  can  be  viewed  as  a  generalization  of  active  area  search  (aas)  (Y.  Ma,  Garnett,  et  al. 
2014),  which  is  a  considerably  simpler  version  of  active  search  for  region-based  labels.  In  aas, 
the  label  of  a  region  is  only  determined  by  whether  its  mean  value  exceeds  some  threshold. 
apps  allows  for  arbitrary  classifiers  rather  than  simple  thresholds,  and  in  some  cases  its  expected 
reward  can  still  be  computed  analytically.  This  extends  the  usefulness  of  this  class  of  algorithms 
considerably. 


7.2  Problem  formulation 

There  are  three  key  components  of  the  apps  framework:  a  function  /  which  maps  input  covariates 
to  data  observations,  a  predetermined  set  of  regions  wherein  instances  of  function  patterns  are 
expected,  and  a  classifier  that  evaluates  the  salience  of  the  pattern  of  function  values  in  each 
region.  We  define  / :  R')!  — >  R  to  be  the  function  of  interest, 1  which  can  be  observed  at  any 
location  x  e  Rm  to  reveal  a  noisy  observation  z.  We  assume  the  observation  model  z  =  fix)  +  s, 
where  s  ~  M0,  cr2).  We  suppose  that  a  set  of  regions  where  matching  patterns  might  be  found  is 

1  For  clarity,  in  this  and  the  next  sections  we  will  focus  on  scalar-valued  functions  /.  The  extension  to  vector-valued 
functions  is  straightforward;  we  consider  such  a  case  in  the  experiments. 
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predefined,  and  will  denote  these  {gi, ...,gk}',gi  c  R'”.  Finally,  for  each  region  g,  we  assume  a 
classifier  hg  which  evaluates  /  on  g  and  returns  the  probability  that  it  matches  the  target  pattern, 
which  we  call  salience :  hg(f)  =  /?(/;  g)  e  [0, 1],  where  the  mathematical  interpretation  of  hg  is 
similar  to  a  functional  of  /.  Classifier  forms  are  typically  the  same  for  all  regions  with  different 
parameters. 

Unfortunately,  in  general,  we  will  have  little  knowledge  about  /  other  than  the  limited 
observations  made  at  our  selected  set  of  points.  Classifiers  which  take  functional  inputs  (such  as 
our  assumed  hg )  generally  do  not  account  for  uncertainty  in  their  inputs,  which  should  be  inversely 
related  to  the  number  of  observed  data  points.  We  thus  consider  the  probability  that  lig(f)  is 
high  enough,  marginalized  across  the  range  of  functions  /  that  might  match  our  observations. 
As  is  common  in  nonparametric  Bayesian  modeling,  we  model  /  with  a  Gaussian  process  (gp) 
prior;  we  assume  that  hyperparameters,  including  prior  mean  and  covariance  functions,  are  set 
by  domain  experts.  Given  a  dataset  T)  =  (A,  z),  we  define 

/  ~  &P(j u,  k);  f\D~  &P{nf\D,  Kf\D), 

to  be  a  given  gp  prior  and  its  posterior  conditioned  on  D,  respectively.  Thus,  since  /  is  a  random 
variable,  we  can  obtain  the  marginal  probability  that  g  is  salient, 

Tg(D)  =  Ef[hg(f)\D].  (7.1) 

We  then  define  a  matching  region  as  one  whose  marginal  probability  passes  a  given  threshold  9. 
Unit  reward  is  assigned  to  each  matching  region  g: 

rg(D)  :=l{rg(£>)>0}. 

We  make  two  assumptions  regarding  the  interactive  procedure.  The  first  is  that  once  a  region 
is  flagged  as  potentially  matching  (i.e.,  its  marginal  probability  exceeds  6),  it  will  be  immediately 
flagged  for  further  review  and  no  longer  considered  during  the  run.  The  second  is  that  the  data 
resulting  from  this  investigation  will  not  be  made  immediately  available  during  the  course  of  the 
algorithm;  rather  the  classifiers  hg  will  be  trained  offline.  We  consider  both  of  these  assumptions 
to  be  reasonable  when  the  cost  of  investigation  is  relatively  high  and  the  investigation  collects 
different  types  of  data.  For  example,  if  the  algorithm  is  being  used  to  run  autonomous  sensors 
and  scientists  collect  separate  data  to  follow  up  on  a  matching  region,  these  assumptions  allow  the 
autonomous  sensors  to  continue  in  parallel  with  the  human  intervention,  and  avoid  the  substantial 
complexity  of  incorporating  a  completely  different  modality  of  data  into  the  modeling  process. 

Garnett  et  al.  (2012)  attempt  to  maximize  their  reward  at  the  end  of  a  fixed  number  of  queries. 
Directly  optimizing  that  goal  involves  an  exponential  lookahead  process.  However,  this  can 
be  approximated  by  a  greedy  search  like  the  one  we  perform.  Similarly,  one  could  attempt  to 
maximize  the  area  under  the  recall  curve  through  the  search  process.  This  also  requires  an 
intractable  amount  of  computation  which  is  often  replaced  with  a  greedy  search. 

We  now  write  down  the  greedy  criterion  our  algorithm  seeks  to  optimize.  Define  Dt  to  be  the 
already  collected  (noisy)  observations  of  /  before  time  step  t  and  Qt  =  {g  :  Tg(DT)  <  9,Vt  <  t} 
to  be  the  set  of  remaining  search  subjects,  those  regions  which  are  not  yet  confidently  salient;  we 
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aim  to  greedily  maximize  the  sum  of  rewards  over  all  the  regions  in  Q,  in  expectation, 


max 

X* 


^"1  rg(£)*) 


x *,  Dt 


(7.2) 


where  IX  is  the  (random)  dataset  augmented  with  x*  . 

This  criterion  satisfies  a  desirable  property:  when  the  regions  are  uncoupled  and  the  classifier 
hg  is  probit-linear,  the  point  that  maximizes  (7.2)  in  each  region  also  minimizes  the  variance  of 
that  region’s  label  (Section  7.3.2). 


7.3  Method 

For  the  aim  of  maximizing  the  greedy  expected  reward  of  finding  matching  patterns  (7.2),  a  more 
careful  examination  of  the  gp  model  can  yield  a  straightforward  sampling  method.  This  method, 
in  the  following,  turns  out  to  be  quite  useful  in  apps  problems  with  rather  complex  classifiers. 
Section  7.3.1  introduces  an  analytical  solution  in  an  important  special  case. 

At  each  step,  given  D,  =  (A,  z)  as  the  set  of  any  already  collected  (noisy)  observations  of  / 
and  x*  as  any  potential  input  location,  we  can  assume  the  distribution  of  possible  observations 
z*  =  /(x*)  +  e  as 

z*|x*,£>,  ~  N(f*f\£>t(x*),  Kf\D,(x*,  x*)  +  cr2).  (7.3) 

Conditioned  on  an  observation  value  z*,  we  can  update  our  gp  model  to  include  the  new  observa¬ 
tion  (x*,  z*),  which  further  affects  the  marginal  distribution  of  region  classifier  outputs  and  thus 
the  probability  this  region  is  matching.  With  D*  =  Dt  U  {(x*,  z*)}  as  the  updated  dataset,  we  use 
rg(D *)  to  be  the  updated  reward  of  region  g.  The  utility  of  this  proposed  location  x*  for  region 
g  is  thus  measured  by  the  expected  reward  function,  marginalizing  out  the  unknown  observation 
value  z*: 


ug(x*,Dt )  :=  E.,  [rg(2X)  |  x*,  Dt\  (7.4) 

=  Pr{rg(£>*)  >  e  I  x*,£>f}.  (7.5) 

Finally,  in  active  pointillistic  pattern  search,  we  select  the  next  observation  location  x*  by  consid¬ 
ering  its  expected  reward  over  the  remaining  regions: 

x*  =  argmax  u(x,Dt )  =  argmax  \  ug(x,Dt).  (7.6) 

x  x  ~T, 

For  the  most  general  definition  of  the  region  classifier  hg,  the  basic  algorithm  is  to  compute 
(7.4)  and  thus  (7.6)  via  sampling  at  two  stages: 

1.  Sample  the  outer  variable  z*  in  (7.4)  according  to  (7.3). 

2.  For  every  draw  of  z*,  sample  enough  of  (/  |  IX)  to  compute  the  marginal  reward  71, (2X) 
in  (7.1),  in  order  to  obtain  one  draw  for  the  expectation  in  (7.4). 

To  speed  up  the  process,  we  can  evaluate  (7.6)  for  a  subset  of  possible  x*  values,  as  long  as  a 
good  action  is  likely  to  be  contained  in  the  set. 
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7.3.1  Analytic  expected  utility  for  functional  probit  models 

For  a  broad  family  of  classifiers,  those  formed  by  a  probit  link  function  of  any  affine  functional 
of  /  (7.7),  we  can  compute  both  (7.1)  and  (7.5)  analytically.  Thus,  we  can  efficiently  perform 
exact  searches  for  potentially  complex  patterns  defined  by  probit-linear  classifiers. 

Suppose  we  have  observed  data  £),  yielding  the  posterior p(f  \  D)  =  p f\£>,  Kpjf).  Let 

Lg  be  a  linear  functional,  Lg:  f  i— »  Lgf  e  R,  associated  with  region  g.  The  family  of  classifiers 
is: 

hg(J)  =  0(Lgf  +  bg),  (7.7) 

where  ®  is  the  cumulative  distribution  function  of  the  standard  normal  and  b  6  1  is  an  offset. 
Two  examples  of  such  functionals  are: 

•  Lgf :  f  h- »  j^j-  fix)  d.r,  where  |g|  is  the  volume  of  region  g  c  R'".  Here  Lgf  is  the  mean 
value  of  /  on  g,  scaled  by  an  arbitrary  c  e  R.  When  |c|  — »  oo  the  model  becomes  quite 
similar  to  that  of  Y.  Ma,  Garnett,  et  al.  (201  ). 

•  Lgf :  f  i — >  wJf( Z),  where  S  is  a  finite  set  of  fixed  points  {£/}j=[,  and  w  6  is  an 
arbitrary  vector.  This  mapping  applies  a  linear  classifier  to  a  fixed,  discrete  set  of  values 
from  /. 

As  Gaussian  processes  are  closed  under  linear  transformations,  Lf  +  b  has  a  normal  distri¬ 
bution: 

Lf  +  b  ~  N(Lpf\0  +  b,  L2Kf\D), 

where  L2  is  the  bilinear  form  defined  by  L2k  :=  L[Lk(x,  •)]  =  L[Lk(-,  x')].  For  the  specific 
cases  above,  we  can  explicitly  calculate  the  mean  and  variance  of  Lf  +  b:  for  Lf  =  wT/( H) 

E f[Lf  |  £)]  =  wJiuf\D(Z)  Var f[Lf  \  D]  =  wJ Kf\D(Z,  Z)  w 

and  for  Lf  =  ^  fg  f(x)  dx 

E /[Lf  |  D\  =  J^pf\D(x)dx  Var f[Lf  \  D\  =  ^  JJ  Kf\D(x,  x)dxdx. 

For  certain  classes  of  covariance  functions  k,  the  above  integrals  are  tractable;  they  occur  when 
estimating  integrals  via  Bayesian  quadrature ,  also  known  as  Bayesian  Monte  Carlo  (Rasmussen 
and  Ghahramani  2003). 

Then  we  have  the  marginal  probability  that  g  is  salient  (7.1)  in  closed  form: 

Tg(D)  =  Ef[hg(f)\D\ 

=  Ef[d>(Lf  +  b)\D] 

I  r  h\ 

_  Lpf\r>  +  b 
=  ®  , 

+L2Kf\DJ 

using  the  fact  that  if  A  ~  N(p,  cr2),  then  E[®(A)]  =  ®  ^/r/Vl  +  cr2j . 
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Now  we  turn  to  the  expected  utility  of  a  new  observation  (7.5).  Consider  a  potential  observa¬ 
tion  location  x*,  and  again  define  :=  D  U  {(x*,  z*)}-  Then  ug(x *,  D)  is: 


(  \ 

' 

ng(x*,  D)  =  Pr 

L/J.f  |Dt  +  b 

1 

>  e 

x*,  D 

[yj  1  +  L2l<f\o,j 

=  Pr 


LW.  +b  .  _! 


tJi  +  L2Kf\£>t 


>  O -\6) 


v*,  £) 


(7.8) 


where  O  1  is  the  inverse  of  the  normal  cdf. 

Letting  the  variance  of  the  new  point  x*  given  the  dataset  D  be  denoted  by 

K|£)  :=  Var[z*  |  £)]  =  Kf\D(x*,  x*)  +  <x2, 


we  have 


«f\D,  =  L2  Kf\0(x,  x)  -  Kf\D(x,  X*)  V  ^  K/|l)(**>  x') 

T1 


—  T  I  <■  i  i 


(7.9) 


which  does  not  depend  on  z*. 

Next,  consider  the  distribution  of  If  we  knew  the  observation  value  z*,  we  could 

compute  the  updated  posterior  mean  as 


Vf\o, 0)  =  Vf\o(x)  +  Kf\o(x,  x*)  (z*  -  ///|£>(x*))  • 

But,  thanks  to  the  linearity  of  L  and  the  known  Gaussian  distribution  on  z*,  the  updated  posterior 
mean  of  Lf  is  also  normally  distributed  with 


Lh/\d,  \x*,D  ~N  (Ln /|d,  V  ^  L  [xf  |d(-,  x*)] ' 


(7.10) 


and  so,  using  (7.10)  in  (7.8),  we  can  finally  compute  the  desired  expected  reward  ng(x*,  £))  in 
closed  form: 


wg(x*,  D)  -  ® 


Lg[if\£)  +  b  yj 

k+L2Kf\0t 

[k/|o(-,x*)]2  ) 

(7.11) 


7.3.2  Analysis  for  independent  regions 

The  analytical  solution  to  (7.5)  by  (7.11)  enables  us  to  further  study  the  theory  behind  the  explo¬ 
ration/exploitation  tradeoff  of  apps  in  one  nontrivial  case:  when  all  regions  are  approximately 
independent.  This  assumption  allows  us  to  ignore  the  effect  a  data  point  has  on  regions  other 
than  its  own.  We  will  answer  two  questions  in  this  case:  which  region  will  apps  explore  next, 
and  what  location  will  be  queried  for  that  region. 


80 


Define 


=  V&L,  [«/!«>(•■  *.f  =  Var[^/|0.U..ffl] 

8  1  +  L2gKf  is  1  +  Var  [Lg/  +  b  \  £>] 

which  in  some  sense  denotes  how  informative  the  observation  z*  is  expected  to  be  to  the  label  of 
its  region  g.  With  this  notation,  (7.9)  becomes 

1  +  LgKf\D,  =  (1  _  Pg(x*Y)(  1  +  L~Kf\£,). 


Assume  for  now  that  6  >  0.5.  (Our  conclusions  remain  true  for  any  6,  but  for  simplicity  we 
consider  only  the  common  case  here.)  Then  we  can  define  how  close  g  is  to  receiving  a  reward 
by 


®  _  Lgfif\£>  +  b 

°'1(0)  (D-K^I+LJk/id 


(7.13) 


Thus  the  utility  (7.11)  becomes,  using  (7.12)  and  (7.13): 


wg(x*,  D)  =  ® 


®_1(60 


^1  -pg(x*)2 
Pg(x*) 


We  can  now  see  by  taking  partial  derivatives  that  for  any  region  not  currently  carrying  a  reward: 

1.  For  any  region  g,  ug(x,  D)  is  maximized  by  choosing  an  x  that  yields  p*  :=  maxA  pg{x). 

2.  If  two  regions  g  and  g'  can  be  equally  explored  (p*  =  p*),  then  the  region  with  higher 

o  6 

probability  of  matching  (higher  R )  will  be  selected. 

3.  If  two  regions  are  equally  likely  to  match  the  desired  pattern  (Rg  =  R,,>),  the  more  explorable 
region  (that  with  a  larger  p*)  will  be  selected. 

4.  In  general,  apps  will  trade  off  the  two  factors  by  maximizing  (Rg  -  J 1  -  (Pg)2)  /p*g- 


7.4  Empirical  evaluation 

We  now  turn  to  an  empirical  evaluation  of  our  framework,  in  three  different  settings  and  with 
three  different  classifiers.  Code  and  data  for  these  experiments  is  available  online.2 

Precision  plots  are  available  in  the  appendix  of  Y.  Ma,  Sutherland,  et  al.  ( 1015)  for  complete¬ 
ness.  Precision  is  determined  primarily  by  the  classifier  and  6,  and  thus  does  not  vary  much 
across  methods. 


7.4.1  Environmental  monitoring  (linear  classifier) 

In  order  to  analyze  the  performance  of  apps  with  the  mean  threshold  classifier,  we  ran  it  on  a 
real  environmental  monitoring  dataset  and  compared  to  baseline  algorithms.  Valada  et  al.  (2012) 

2https : //github . com/AutonlabCMU/ActivePatternSearch/ 
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used  small  (60  cm)  autonomous  fan-powered  boats  to  collect  dissolved  oxygen  (do)  readings  in  a 
pond,  with  the  goal  of  finding  regions  that  are  low  in  dissolved  oxygen,  an  indicator  of  poor  water 
quality.  The  data  used  in  our  experiment  comes  from  a  pond  approximately  150  meters  wide  and 
50  meters  long.  The  mobile  robots  have  a  cell-phone  module  that  records  the  time  and  location  of 
every  measurement.  Because  of  physical  limitations,  the  measurement  reading  does  not  stabilize 
for  about  one  minute.  Therefore,  in  data  collection,  the  boat  was  moved  back  and  forth  in  a  single 
location,  in  the  hope  that  the  noise  would  cancel  by  averaging  these  measurements. 


(a)  Data  and  true  matching  regions  (black). 


(b)  apps  collected  data  and  posterior  region  probability. 


Figure  7.1:  Illustration  of  dataset  and  apps  selections  for  one  run.  A  point  marks  the  location 
of  a  measurement  whose  value  is  also  reflected  in  its  color.  Every  grid  box  is  a  region  whose 
possibility  of  matching  is  reflected  in  grayscale. 


In  order  to  verify  our  methods,  we  borrowed  data  from  Valada  et  al.  (20 1 2),  comprising  1 6  960 
location/DO  value  pairs,  and  fit  a  gp  model  by  maximizing  the  likelihood  of  the  prior  parameters 
on  500  random  samples  seven  times,  taking  the  median  of  the  learned  hyperparameter  values.  We 
used  a  squared-exponential  kernel  with  a  learned  length  scale.  We  defined  regions  by  covering 
the  map  with  many  windows  of  size  comparable  to  the  gp  length  scale,  and  used  parameters 
b  =  -9,  c  =  -100.  Data  points  and  classifier  probability  outputs  for  the  ground  truth  are  shown 
in  Figure  7.1a,  which  also  shows  the  learned  length  scale  (roughly  3  meters). 

We  then  repeated  the  following  experiment:  we  randomly  sampled  6  000  points  at  a  time 
from  data  points  not  used  for  gp  parameter  training,  and  randomly  selected  10  of  these  6000 
points  to  form  an  initial  training  set  D.  We  then  used  several  competing  methods  to  sequentially 
make  further  queries  until  300  total  observations  were  obtained.  The  considered  algorithms 
were:  apps  with  analytical  solutions,  apps  with  one  draw  of  z*  at  each  candidate  location,  aas 
(Y.  Ma,  Garnett,  et  al.  201- )  with  analytical  solutions,  aas  with  sampling,  the  level  set  estimation 
(lse)  algorithm  of  Gotovos  et  al.  (2013)  with  parameters  /T  =  6.25  and  s  =  0.1,  uncertainty 
sampling  (unc),  and  random  selection  (rand).  Each  algorithm  chose  queries  based  on  its  own 
criterion;  the  quality  of  queried  points  was  evaluated  by  the  mean  threshold  classifier  with  the 
above  parameters  and  was  then  compared  with  true  region  labels  that  were  computed  by  the  mean 
threshold  classifier  using  all  6000  data  points.  A  70%  marginal  probability  was  chosen  to  be 
required  for  a  region  to  be  classified  as  matching  ( 6  =  0.7). 

Figure  7.2a  reports  the  mean  and  standard  error  of  the  recall  of  matching  regions  over 
15  repetitions  of  this  experiment,  apps  and  aas  with  both  analytical  solutions  and  sampling 
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performed  equally  well  here.  The  similarity  between  apps  and  aas  is  also  expected  because  in 
linear  problems,  the  choice  of  c  (the  only  difference  between  the  algorithms  here)  is  relatively 
minor.  Notice  that  aas  is  not  able  to  handle  any  other  classifier-based  setting;  this  is  the  core 
contribution  of  apps.  To  understand  why  analytical  solutions  were  similar  to  sampling,  notice 
that  the  data  collection  locations  have  to  be  constrained  to  those  actually  recorded,  which  makes 
it  easier  to  obtain  a  near-optimal  decision. 


(a)  Recall  curves. 
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(b)  Precision  curves. 


Figure  7.2:  Results  for  the  pond  monitoring  experiment.  Color  bands  show  standard  errors  after 
15  runs. 


The  second  group  in  performance  ranking  is  the  lse  method.  We  attempted  to  boost  its 
performance  by  selecting  its  parameters  to  directly  optimize  the  area  under  its  recall  curve,  which 
was,  in  a  sense,  cheating.  On  further  analysis  of  its  query  decisions,  we  saw  lse  making,  for 
the  most  part,  qualitatively  similar  selection  decisions  to  apps.  lse  will  stop  collecting  data  in 
a  region  if  there  is  enough  confidence,  but  does  not  specifically  try  to  push  regions  over  the 
threshold,  and  so  its  performance  on  this  objective  is  inferior. 

Last  in  the  comparison  are  rand  and  unc.  It  is  interesting  to  observe  that  rand  was  initially 
better  than,  but  later  crossed  by  unc.  In  the  beginning,  since  unc  is  purely  explorative,  its 
reward  uniformly  remained  low  across  multiple  runs,  whereas  in  some  runs  rand  queries  can  be 
lucky  enough  to  concentrate  around  matching  regions.  At  a  later  phase,  rand  faces  the  coupon 
collector’s  problem  and  may  select  redundant  boring  observations,  whereas  unc  keeps  making 
progress  at  a  constant  rate. 

Figure  7.2b  shows  results  for  precision.  Sampling -based  methods  for  apps  and  aas  had 
lower  precision  than  analytical  ones,  because  the  noise  of  sampling  makes  it  more  likely  for  an 
“accidental”  flag  of  a  region  which  then  persists. 


7.4.2  Predicting  election  results  (linear  classifier) 

Consider  the  problem  of  a  state-level  political  party  official  who  wishes  to  determine  which  races 
will  be  won,  lost,  or  might  go  either  way.  As  surveying  likely  voters  is  relatively  expensive,  we 
would  like  to  do  so  with  as  few  surveys  as  possible. 
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In  a  simple  model  of  this  problem,  the  problem  of  finding  races  which  will  be  won  is  a  natural 
fit  to  a  classifier  of  the  form  hg(f)  =  <D(wT/(Hg)  +  bg ).  Our  function  /  maps  from  the  voting 
precincts  in  the  state  to  the  vote  share  of  a  given  party  in  that  district,  with  a  covariance  kernel 
defined  by  demographic  similarity  and  geographic  proximity.  To  account  for  multiple  races  taking 
place  in  each  district  (e.g.,  state  and  national  legislators),  we  duplicate  each  precinct  with  a  flag 
for  the  type  of  election.  If  Sg  is  the  set  of  all  precincts  participating  in  a  particular  race  and  wg 
is  some  constant  c  times  the  voting  population  of  each  precinct,  then  wJ /(Hg)  gives  c  times  the 
total  vote  portion  for  the  given  party  in  that  election.  In  a  simple  model  which  ignores  turnout 
effects,  the  probability  of  winning  a  race  is  essentially  1  if  the  underlying  proportion  is  greater 
than  0.5  and  0  otherwise;  this  can  be  accomplished  by  setting  c  to  some  fairly  large  constant, 
say  100,  and  b  =  —\c.  (An  equally  simple  model  that  nonetheless  more  thoroughly  accounts  for 
unmodeled  effects  would  just  use  a  smaller  value  of  c.) 

We  ran  experiments  based  on  this  model  on  2010  Pennsylvania  election  returns  (Ansolabehere 
and  Rodden  2011).  For  each  voting  precinct  in  the  dataset,  we  used  the  2010  Decennial  Census 
(United  States  Census  Bureau  2010)  to  obtain  a  total  population  count  and  percentages  of  the 
population  for  gender,  race,  age,  and  housing  type  categories;  we  also  added  an  (x,  y)  location 
based  on  a  Lambert  conformal  conic  projection  of  a  point  in  the  precinct,  and  used  these  features 
in  a  squared-exponential  kernel.  The  data  for  each  precinct  was  then  replicated  three  times 
and  associated  with  Democratic  vote  shares  for  its  U.S.  House  of  Representatives,  Pennsylvania 
House  of  Representatives,  and  Pennsylvania  State  Senate  races;  the  demographic/geographic 
kernel  was  multiplied  by  a  positive-definite  covariance  matrix  amongst  the  races.  We  learned  the 
hyperparameters  for  this  kernel  by  maximizing  the  likelihood  of  the  model  on  full  2008  election 
data. 

Given  the  kernel,  we  set  up  experiments  to  predict  2010  races  based  on  surveying  an  individual 
voting  precinct  at  a  time.  For  simplicity,  we  assume  that  a  given  voting  precinct  can  be  thoroughly 
surveyed  (and  ignore  turnout  effects,  voters  changing  their  minds  over  time,  and  so  on);  thus 
observations  were  made  with  the  true  vote  share.  We  seeded  the  experiment  with  a  random  10 
(out  of  16  226)  districts  observed;  apps  selected  from  a  random  subset  of  100  proposals  at  each 
step.  We  again  used  6  =  0.7. 


Figure  7.3:  Results  for  election  prediction.  Color  bands  show  standard  errors  over  15  runs. 
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Figure  7.3a  shows  the  mean  and  standard  errors  of  recalls  over  15  runs  for  apps,  unc,  and 
rand,  lse  and  aas  are  not  applicable  to  this  problem,  as  they  have  no  notion  of  weighting  points 
(by  population),  apps  outperforms  both  random  and  uncertainty  sampling  here,  though  in  this 
case  the  margin  over  random  sampling  is  much  narrower.  This  is  probably  because  the  portion 
of  regions  which  are  positive  in  this  problem  is  much  higher,  so  more  points  are  informative. 
Uncertainty  sampling  is  in  fact  worse  than  random  here,  which  is  not  too  surprising  because 
the  purely  explorative  nature  of  unc  is  even  worse  on  the  high  dimensional  input  space  of  this 
problem. 

7.4.3  Finding  vortices  (black-box  classifier) 

The  problem  we  consider  here  requires  more  complex  pattern  classifiers.  We  study  the  task 
of  identifying  vortices  in  a  vector  field  based  on  limited  observations  of  flow  vectors.  Linear 
classifiers  are  insufficient  for  this  problem,3  so  we  will  demonstrate  the  flexibility  of  our  approach 
with  a  black-box  classifier. 

To  illustrate  this  setting,  we  consider  the  results  of  a  large-scale  simulation  of  a  turbulent  fluid 
in  three  dimensions  over  time  in  the  Johns  Hopkins  Turbulence  Databases4  (Perlman  et  al.  2007). 
Following  Sutherland,  Xiong,  et  al.  (2012),  we  aim  to  recognize  vortices  in  two-dimensional 
slices  of  the  data  at  a  single  timestep,  based  on  the  same  small  training  set  of  1 1  vortices  and  20 
non-vortices,  partially  shown  in  Figure  7.4. 

Recall  that  hg  assigns  probability  estimates  to  the  entire  function  class  T confined  to  region  g. 
Unlike  the  previous  examples,  it  is  insufficient  to  consider  only  a  weighted  integral  of  /.  We  can 
consider  the  average  flow  across  sectors  (angular  slices  from  the  center)  of  our  region  as  building 
blocks  in  detecting  vortices.  We  count  how  many  sectors  have  clockwise/counter-clockwise  flows 
to  give  a  classification  result,  in  three  steps: 

1.  First,  we  divide  a  region  into  K  sectors.  In  each  sector,  we  take  the  integral  of  the  inner 
product  between  the  actual  flow  vectors  and  a  template.  The  template  is  an  “ideal”  vortex, 
but  with  larger  weights  in  the  center  than  the  periphery.  This  produces  a  /f -dimensional 
summary  statistic  Ls(f )  for  each  region. 

2.  Next,  we  improve  robustness  against  different  flow  speeds  in  the  data  by  scaling  L„(f)  to 
have  maximum  entry  1,  and  flip  its  sign  if  its  mean  is  negative.  Call  the  result  Lg(f). 

3.  Finally,  we  feed  the  normalized  Lg(f )  vector  through  a  2-layer  neural  network  of  the  form 


where  cr  is  the  logistic  sigmoid  function. 

Because  Lg,  which  is  effectively  taking  the  Lo  inner  product  with  K  fixed  template  functions, 
is  a  linear  operator,  Lg(f )  |  D  obeys  a  A'-dimcnsional  multivariate  normal  distribution.  We 
sample  many  possible  Lg(f )  from  that  distribution,  which  we  then  normalize  and  pass  through 

3The  set  of  vortices  is  not  convex:  consider  the  midpoint  between  a  clockwise  vortex  and  its  identical  counter¬ 
clockwise  case. 

4http : // turbulence . pha . jhu . edu 
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the  neural  network  as  described  above.  This  gives  samples  of  probabilities  hg,  whose  mean  is  a 
Monte  Carlo  estimate  of  (7.1). 
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Figure  7.4:  (a):  Positive  (top)  and  negative  (bottom)  training  examples  for  the  vortex  classifier, 
(b):  The  velocity  field  used;  each  arrow  is  the  average  of  a  2  x  2  square  of  actual  data  points. 
Background  color  shows  the  probability  obtained  by  each  region  classifier  on  the  200  circled 
points;  red  circles  mark  points  selected  by  one  run  of  apps  initialized  at  the  green  circles. 


We  used  K  =  4  sectors,  and  the  weights  in  the  template  were  fixed  such  that  the  length  scale 
matches  the  distance  from  the  center  to  an  edge.  The  network  was  optimized  for  classification 
accuracy  on  the  training  set.  We  then  identified  a  50  x  50-pixel  slice  of  the  data  that  contains  two 
vortices,  some  other  “interesting”  regions,  and  some  “boring”  regions,  mostly  overlapping  with 
Figure  11  of  Sutherland,  Xiong,  et  al.  (2012);  the  region,  along  with  the  output  of  the  classifier 
when  given  all  of  the  input  points,  is  shown  in  Figure  7.4a.  We  then  ran  apps,  initialized  with  10 
uniformly  random  points,  for  200  steps.  We  defined  the  regions  to  be  squares  of  size  11x11  and 
spaced  them  every  2  points  along  the  grid,  for  400  total  regions.  We  again  thresholded  at  8  =  0.7. 
We  evaluate  (7.1)  via  a  Monte  Carlo  approximation,  as  in  the  general  form  of  the  algorithm  in 
Section  7.3:  First,  we  pick  80  random  candidate  locations  x*.  For  each  a;*,  we  took  4  samples  of 
z*.  For  each  z*,  we  obtained  the  posterior  of  /  over  the  evaluation  window,  and  evaluated  hg  on 
15  different  samples  from  that  posterior. 

Figure  7.5a  shows  recall  curves  of  apps,  uncertainty  sampling  (unc),  and  random  selection 
(rand),  where  for  the  purpose  of  these  curves  we  call  the  true  label  the  output  of  the  classifier 
when  all  data  is  known,  and  the  proposed  label  is  true  if  Tg  >  6  at  that  point  of  the  search 
(evaluated  using  more  Monte  Carlo  samples  than  in  the  search  process,  to  gain  assurance  in  our 
evaluation  but  without  increasing  the  time  required  for  the  search).  We  can  see  that  active  pattern 
search  substantially  outperforms  uncertainty  sampling  and  random  selection.  It  is  interesting  to 
observe  that  rand  was  initially  better  than,  but  later  crossed  by  unc.  In  the  beginning,  since 
unc  is  purely  explorative,  its  reward  uniformly  remained  low  across  multiple  runs,  whereas  in 
some  runs  rand  queries  can  be  lucky  enough  to  concentrate  around  matching  regions.  At  a  later 
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Figure  7.5:  Results  for  the  vortex  experiment.  Color  bands  show  standard  errors  over  15  runs. 


phase,  rand  faces  the  coupon  collector’s  problem  and  may  select  redundant  boring  observations, 
whereas  unc  keeps  making  progress  at  a  constant  rate. 
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Chapter  8 

Conclusions  and  future  directions 


If  there  is  but  a  single  take-away  message  from  this  thesis,  it  is  perhaps  that:  in  any  machine 
learning  problem,  it  is  vital  to  consider  how  to  model  your  data.  Sets  and  distributions  are  a 
flexible  choice  that  cover  many  use  cases,  and  can  be  applied  to  many  different  problem  areas,  as 
seen  in  Chapters  1  and  5. 

Random  feature  embeddings  are  also  an  important  tool  for  scalable  learning,  but  when  using 
random  Fourier  features  make  sure  to  use  the  right  choice  (Chapter  3).  They  have  important 
advantages  over  the  Nystrom  approach  in  ease  of  distributing  across  multiple  machines  and  of 
integration  into  deep  learning  settings;  their  relative  performance  depends  on  the  problem  setting, 
but  Nystrom  embeddings  with  approximate  leverage  scores  seem  promising  (T.  Yang  et  al.  2012; 
El  Alaoui  and  Mahoney  1015;  Rudi  et  al.  2016).  For  distribution  learning,  the  mmd  embedding 
based  on  random  Fourier  features  for  the  Gaussian  rbf  kernel  (Section  4.1)  is  very  simple  and 
typically  performs  well,  though  in  some  cases  the  hdd  embedding  of  Section  4.3  may  be  better. 

For  hard  problems,  learning  more  complex  kernels  is  extremely  important.  We  have  proposed 
a  promising  new  method  for  doing  so  in  the  setting  of  two-sample  testing  in  Chapter  6,  and  its 
integration  with  deep  learning  to  learn  very  powerful  kernels  is  quite  promising.  For  general 
learning  on  distributions,  integration  with  deep  networks  as  proposed  in  Section  8.2  is  a  promising 
way  forward  that  still  needs  more  study. 

Active  learning  is  also  an  important  problem  with  many  real-world  applications.  Chapter  7 
gave  an  algorithm  for  the  particular  problem  of  active  pointillistic  pattern  search,  which  is  of  a 
similar  flavor  to  learning  on  distributions.  Section  8.4  discusses  some  approaches  to  true  active 
learning  on  distributions. 

How  to  solve  a  new  distribution  learning  problem  Given  a  new  problem  that  can  reasonably 
be  phrased  as  a  distribution  learning  problem,  the  “default”  choice  should  probably  be  with  mmd 
based  on  the  Gaussian  rbf  kernel,  which  is  the  simplest  approach  that  has  shown  empirical 
success  in  a  variety  of  areas  and  is  supported  by  theory  (Szabo  et  al.  2015);  either  the  pairwise 
estimator  or  the  random  Fourier  feature  embedding  is  fine,  though  for  either  large  numbers  of 
distributions  or  for  many  samples  from  each  distribution  the  embedding  is  quite  helpful.  Tuning 
the  kernel  bandwidth  is  important,  and  should  probably  be  done  by  cross-validation  on  the  final 
learning  performance. 

If  performance  there  is  not  satisfactory,  for  moderate  sample  sizes  pairwise  estimators  of 
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other  distributional  distances  (as  in  Chapter  2)  may  work  better;  for  larger  sample  sizes  with 
low-dimensional  distributions,  the  embedding  of  Section  4.3  is  sometimes  preferable.  With 
high-dimensional  distributions  and  large  sample  sizes,  perhaps  the  dimensions  can  be  treated 
independently  (as  in  Section  5.3.2),  but  otherwise  these  basic  choices  are  exhausted. 

If  better  performance  is  still  required,  the  best  choice  is  probably  to  explore  integration  of  the 
mmd  embedding  with  deep  learning,  as  in  Section  8.2. 


The  remainder  of  this  chapter  discusses  future  areas  to  explore. 


8.1  Deep  learning  of  kernels  for  two-sample  testing 

The  tjj  and  tjj  statistics  of  Chapter  6  are  quite  naturally  suited  to  performing  two-sample  testing 
problems  with  deep  learning.  Thorough  evaluations  of  this  approach  to  difficult  two-sample 
testing  problems  are  underway  now. 

One  prominent  potential  application  is  to  the  very  popular  framework  of  generative  adversarial 
networks  (Goodfellow  et  al.  201  ),  in  which  a  generator  network  attempts  to  create  samples 
that  look  like  training  samples  by  tricking  an  adversary  network,  which  attempts  to  distinguish 
generated  samples  from  training.  As  simultaneously  noted  by  Y.  Li  et  al.  (  015)  and  Dziugaite 
et  al.  (2015),  the  adversary  network  can  be  thought  of  as  performing  a  two-sample  test  between 
a  batch  of  generated  samples  and  the  training  set,  and  so  the  adversary  can  be  simply  replaced 
by  an  mmd  test.  Dziugaite  et  al.  (2015)  attempted  to  do  so  with  a  fixed  Gaussian  rbf  kernel, 
which  performed  poorly  on  generating  images,  because  the  kernel  has  a  very  poor  understanding 
of  images.  Y.  Li  et  al.  (  015)  worked  around  this  by  (essentially)  using  mmd  with  the  Gaussian 
rbf  kernel  on  the  latent  codes  learned  by  a  fixed  autoencoder  instead,  and  got  much  better  results. 
We  may  be  able  to  do  better,  however,  by  using  an  adversary  based  on  an  mmd  test  with  a  kernel 
learned  (via  the  t\j  or  t\j  criteria  of  Chapter  6)  for  the  particular  comparison  at  hand.  This  method 
will  be  fully  adaptive  to  whatever  the  generator  network  chooses  to  create,  rather  than  relying  on 
the  fixed  autoencoder-based  kernel  as  in  Y.  Li  et  al.  (  015). 


8.2  Deep  learning  of  kernels  for  distribution  learning 

Manually  specifying  featurizations  and  kernels  can  be  an  arduous  task,  especially  for  those 
inexperienced  with  the  precise  methods  in  use.  In  certain  problems  in  computer  vision,  even  years 
of  extremely  active  development  on  different  human-designed  featurizations  have  not  matched 
the  performance  of  learned  features.  The  further  adoption  of  distribution  learning  would  benefit 
greatly  from  integration  with  representation  learning  techniques. 

In  Chapter  6,  we  explored  the  automated  learning  of  complex  kernels  for  two-sample  testing 
problems,  primarily  using  pairwise  kernel  estimators.  Though  similar  techniques  could  be 
applicable  to  regression  and  classification  tasks  —  and  indeed  Yoshikawa  et  al.  (201- ,  2015)  use 
techniques  that  could  be  viewed  as  being  along  these  lines  —  when  there  are  many  distributions 
to  compare,  rather  than  just  the  two  of  a  two-sample  testing  problem,  that  task  is  more  difficult. 
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The  embeddings  discussed  in  Chapter  4,  however,  provide  a  natural  means  to  use  deep  learning 
techniques  in  distribution  learning,  and  vice  versa:  given  inputs  {jq, . . . ,  xn},  simply  compute  a 
deep  representation  {f(x  1), . . . ,  /(*„)}  convolutionally,  then  pass  through  z  ({f(x  1), . . . ,  f{x„)}) 
before  performing  the  learning  task.  The  mmd  embedding  of  Section  4.1  and  the  Lo  embedding 
of  Section  4.2  are  both  easily  differentiable  (depending  on  the  choice  of  kernel  or  basis  functions), 
and  simple  to  implement  within  deep  learning  frameworks.  The  hdd  embeddings  of  Section  4.3 
would  be  more  complex,  though  possible,  to  use  in  this  manner. 

8.2.1  Integration  with  deep  computer  vision  models 

In  Section  5.3.2,  we  considered  using  the  features  learned  by  a  standard  convolutional  deep 
network  as  samples  from  an  image-level  distribution  of  local  features,  and  classified  images 
based  on  those  sets  of  features.  Here  features  are  trained  using  fully-connected  final  layers  as  the 
learning  model,  but  then  used  in  a  separate  distributional  kernel  model.  We  can  instead  make  a 
coherent  model  which  combines  feature  extraction  with  a  learning  model  based  on  a  distributional 
kernel,  by  making  a  distributional  embedding  layer  in  the  network. 

In  fact,  the  “network  in  network”  architecture  of  M.  Lin  et  al.  ( ’Of  )  popularized  the  idea  of 
replacing  the  late-layer  fully-connected  layers  of  AlexNet-type  models  (Krizhevsky  et  al.  2012) 
with  global  average  pooling,  treating  each  convolutional  filter  as  providing  a  score  for  a  given 
class  label  and  aggregating  with  the  mean.  Szegedy  et  al.  (2014)  later  adopted  this  idea,  though 
they  added  a  layer  after  the  average  pooling  in  an  attempt  to  ease  cross-task  adaptation.  In  the 
distributional  framework,  we  can  think  of  this  now  as  a  classifier  based  on  a  linear-kernel  mmd 
embedding. 

Linear-kernel  mmd,  however,  compares  distributions  based  only  on  their  mean.  By  using  e.g. 
random  Fourier  features  for  a  Gaussian  rbf  kernel,  we  can  derive  a  richer  classifier  structure. 
We  conducted  an  initial  exploration  of  this  approach  in  J.  B.  Oliva,  Sutherland,  et  al.  (2015), 
taking  networks  initially  trained  on  ImageNet  (Russakovsky  et  al.  20  U  )  and  adapting  them  to 
other  classification  tasks:  Flickr  Style  (Karayev  et  al.  2013),  Wikipaintings  (Karayev  et  al.  2013), 
and  Places  (Zhou  et  al.  1).  We  used  both  AlexNet  and  GoogLeNet  architectures,  either 
replacing  or  augmenting  the  final  classification  layers  with  random  Fourier  feature  Gaussian  rbf 
embeddings,  and  found  small  but  consistent  improvements  in  classification  accuracies  compared 
to  adapting  the  original  model.  For  details,  see  J.  B.  Oliva,  Sutherland,  et  al.  (2015). 

It  seems  plausible  that  the  reason  the  improvements  here  were  not  as  large  as  we  might  have 
hoped  is  that  we  were  fine-tuning  features  initially  found  to  work  for  the  existing  architecture, 
whereas  the  optimal  features  to  use  when  making  full  distributional  comparisons  are  probably 
somewhat  different.  Thus,  training  the  distributional  variants  of  the  network  from  scratch  and 
perhaps  varying  the  earlier  architecture  of  the  network  as  well  would  be  required  to  fully  realize 
the  potential  of  this  work.  We  leave  this  time-consuming  process  to  future  work. 

8.2.2  Other  paramaterizations  for  kernel  learning 

In  addition  to  learning  the  mapping  f(x)  used  before  the  kernel,  one  can  also  consider  learning 
the  kernel  itself.  When  using  a  random  Fourier  feature-based  approach,  learning  the  bandwidth 
of  the  kernel  is  simple:  sample  a>i  ~  M0,  Id)  and  then  scale  the  inputs  by  perhaps  with  cr 
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parameterized  as  cr  =  cxp(.v).  However,  one  can  also  consider  learning  the  values  oj,  themselves, 
learning  the  kernel  via  its  Fourier  transform.  This  was  evaluated  in  J.  B.  Oliva,  Sutherland,  et  al. 

(2015). 

This  parameterization,  however,  might  not  be  the  best  way  to  learn  the  kernel.  Z.  Yang  et  al. 
(2015)  use  the  Fastfood  approximation  (Le  et  al.  201  )  to  random  Fourier  features  and  learn  only 
certain  parts  of  the  spectral  representation,  rather  than  directly  adjusting  the  frequencies.  This 
may  result  in  a  nicer  optimization  surface. 


8.3  Word  and  document  embeddings  as  distributions 

Until  recently,  much  work  in  natural  language  processing  treated  words  as  unique  symbols,  e.g. 
with  “one-hot”  vectors,  where  the  ith  word  from  a  vocabulary  of  size  V  is  represented  as  a 
vector  with  zth  component  1  and  all  other  components  0.  It  has  recently  become  widely  accepted 
that  applications  can  benefit  from  richer  word  embeddings  which  take  into  account  the  similarity 
between  distinct  words,  and  much  work  has  been  done  on  dense  word  embeddings  so  that  distances 
or  inner  products  between  word  embeddings  represent  word  similarity  in  some  way  (e.g.  Collobert 
and  Weston  2008;  Turian  et  al.  2010;  Mikolov  et  al.  2013).  These  embeddings  can  be  learned  in 
various  ways,  but  often  involve  optimizing  the  representation’s  performance  in  some  supervised 
learning  task. 


Document  representations  First,  it  is  worth  noting  that  although  this  breaks  the  traditional 
“bag  of  words”  text  model  (where  documents  can  be  represented  simply  by  the  sum  of  the  words’ 
one-hot  encodings),  we  can  represent  documents  by  viewing  them  as  sample  sets  of  word  vectors. 

Kusner  et  al.  ( ’015)  recently  adopted  this  model,  using  kNN  classifiers  based  on  the  Earth 
Mover’s  Distance  (emd)  between  documents,  and  obtained  excellent  empirical  results,  emd, 
however,  is  expensive  to  compute  even  for  each  pair  of  documents  when  the  vocabulary  is  large, 
and  additionally  must  be  computed  pairwise  between  documents;  an  approximate  embedding  in 
the  style  of  Chapter  4  is  not  known. 

Yoshikawa  et  al.  (2014),  in  their  empirical  results,  considered  this  model  with  MMD-based 
kernels  (but  computing  pairwise  kernel  values  rather  than  approximate  embeddings).  Their 
main  contribution,  however,  is  to  optimize  the  word  embedding  vectors  for  final  classification 
performance;  by  doing  so  with  random  initializations,  they  saw  mild  performance  improvements 
over  mmd  kernels  using  substantially  less  training  data  for  the  embeddings  but  at  much  higher 
computational  cost.  Yoshikawa  et  al.  (2015)  extend  the  approach  to  Gaussian  process  regression 
models,  but  do  not  compare  to  separately- learned  word  embeddings. 

Because  of  the  limited  empirical  evaluation,  particularly  on  larger  datasets,  it  is  currently 
unclear  how  these  methods  compare  to  one  another  or  to  other  approaches  for  document  rep¬ 
resentation.  Additionally,  perhaps  fine-tuning  existing  word  embeddings  learned  on  a  standard 
dataset  simultaneously  with  learning  the  regression  or  classification  model  for  a  particular  ap¬ 
plication,  as  is  common  in  deep  learning  models  for  computer  vision,  would  provide  additional 
power. 
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Richer  word  representation  Embedding  words  as  a  single  vector  does  not  allow  for  as  rich 
a  word  representation  as  we  might  wish.  Vilnis  and  McCallum  (2015)  embed  words  instead  as 
Gaussian  distributions,  and  use  the  kl  divergence  between  word  embeddings  to  measure  asym¬ 
metric  hypernym  relationships:  for  example,  their  embedding  for  the  word  Bach  is  “included”  in 
their  embeddings  for  famous  and  man,  and  mostly  included  in  composer.  Gaussian  distributions, 
of  course,  are  still  fairly  limiting;  for  example,  a  multimodal  embedding  might  be  able  to  capture 
word  sense  ambiguity,  whereas  a  Gaussian  embedding  would  be  forced  to  attempt  to  combine 
both  senses  in  a  single  broad  embedding. 

We  can  thus  consider  richer,  nonparametric  classes  of  word  embeddings:  perhaps  by  rep¬ 
resenting  a  word  as  a  (possibly  weighted)  set  of  latent  vectors.  Comparisons  could  then  be 
performed  either  with  an  MMD-based  kernel,  when  symmetry  is  desired,  or  with  kl  estimators  (or 
similar)  when  not. 

One  approach  would  be  to  choose  these  vectors  arbitrarily,  optimizing  them  for  the  output  of 
some  learning  problem:  this  would  be  implemantionally  similar  to  the  approach  of  Yoshikawa 
et  al.  (201-1 ,2015)  for  mmd  distances,  or  somewhat  like  that  of  Vilnis  and  McCallum  (2015)  but 
with  greater  computational  cost,  and  greater  flexibility,  for  kl  distances. 

Another  approach  is  inspired  by  the  classic  distributional  hypothesis  of  Harris  (1954),  that  the 
semantics  of  words  are  characterized  by  the  contexts  in  which  it  appears.  Many  word  embedding 
approaches  can  be  viewed  as  matrix  factorizations  of  a  matrix  M  with  rows  corresponding  to 
words,  columns  to  some  notion  of  context,  and  entries  containing  some  measure  of  association 
between  the  two;  the  factorization  M  =  WCT  then  typically  discards  the  matrix  C  and  uses 
the  rows  of  W  as  word  vectors.  This  approach  is  sometimes  taken  explicitly;  interestingly,  the 
popular  method  of  Mikolov  et  al.  (2013)  can  be  seen  as  approximating  this  form  as  well  (Levy 
and  Goldberg  2014).  This  view  inspires  a  natural  alternative:  treat  each  word  as  the  sample 
set  of  contexts  in  which  it  appears,  representing  each  context  via  the  learned  context  vectors. 
This  is  perhaps  the  most  direct  instantiation  of  the  distributional  hypothesis:  compare  words  by 
comparing  the  distribution  of  contexts  in  which  they  appear. 


8.4  Active  learning  on  distributions 

Suppose  we  have  a  collection  of  distributions,  but  initially  we  have  very  few  samples  from 
each  distribution.  We  can  choose  to  take  additional  iid  observations,  but  doing  so  is  relatively 
expensive;  perhaps  it  requires  real-world  expenditure  of  time  or  resources  to  collect  samples, 
or  perhaps  these  distributions  are  available  only  through  computationally  intensive  numerical 
simulations.  We  may  wish  to  learn  a  classification  or  regression  function  mapping  from  these 
distributions  to  some  label  (similar  to  traditional  active  learning  settings),  to  locate  distributions 
which  follow  some  prespecified  pattern  (similar  to  the  setting  of  Chapter  7  with  independent 
regions),  or  to  find  the  distribution  which  is  “best”  in  some  sense  (as  in  pure-exploration  bandit 
problems,  Bubeck  et  al.  2010).  In  any  of  these  cases,  we  need  to  choose  some  selection  criterion 
that  will  appropriately  consider  the  utility  of  selecting  points  from  distributions,  a  problem  that 
is  related  to  but  certainly  distinct  from  typical  fully-observed  active  learning  models. 

In  the  dark  matter  prediction  experiments  of  Section  5.1,  we  assumed  that  each  observed 
galaxy  has  a  well-known  line-of-sight  velocity  estimated  via  redshift.  In  practice,  good  velocity 
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estimates  are  available  only  through  relatively-expensive  spectroscopic  imaging;  cheaper  few- 
color  imaging  techniques  give  extremely  uncertain  velocity  estimates.  We  could  simply  ignore 
the  imaging  estimates  and  apply  the  previous  model,  selecting  a  random  galaxy  from  each  halo 
to  perform  spectroscopy  upon.  It  would  probably  be  more  effective,  however,  to  consider  active 
learning  methods  that  begin  with  visual  imaging,  and  then  identify  which  objects  will  be  useful 
for  spectroscopy  in  order  to  best  identify  the  masses  of  their  dark  matter  halos.  One  modeling 
option  would  be  to  take  a  probability  distribution  over  the  sample  set,  and  then  identify  the 
resulting  distribution  of  the  mean  map  embedding  and  therefore  its  predicted  label  under  a 
learned  predictor;  we  would  then  identify  objects  to  observe  that  most  reduce  uncertainty  in  the 
predicted  label.  This  could  be  conducted  either  for  a  single  halo,  where  the  objective  is  to  best 
learn  its  mass,  or  across  multiple  halos,  where  the  objective  is  either  to  find  the  most  massive  halos 
(active  search)  or  to  reduce  some  form  of  overall  uncertainty  in  all  of  the  halo  mass  predictions 
(active  learning). 
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Appendix  A 

The  ski -groups  package 


Efficient  implementations  of  several  of  the  methods  for  learning  on  distributions  discussed  in 
this  thesis  are  available  in  the  Python  package  ski-groups'.  This  package  integrates  with  the 
standard  Python  numerical  ecosystem  and  presents  an  api  compatible  with  that  of  scikit-  learn 
(Grisel  et  al.  2016). 

The  package  is  designed  around  the  Transformer  interface  of  scikit-learn,  and  as  much 
as  possible  to  work  with  its  pipelines.  But  the  scikit-learn  api,  which  works  almost  exclusively 
with  Euclidean  feature  vectors,  it  is  assumed  in  most  places  that  features  are  represented  as  an 
array  of  shape  ( n ,  d),  where  each  row  represents  a  feature  vector.  In  ski -groups,  each  object 
is  a  set  of  vectors.  This  is  represented  as  a  Python  list  (or  numpy  object  array)  of  numeric 
arrays,  where  each  array  is  of  shape  d):  n  -,  can  vary  from  element  to  element.  Internally,  most 
methods  convert  these  leasts  into  Features  objects,  which  provide  convenient  helpers  to  access 
the  data  in  a  consistent  way,  and  can  optionally  store  any  metadata  associated  with  each  element. 

The  class  supports  data  storage  either  as  a  collection  of  separate  numeric  arrays,  or  as  a 
single  stacked  array  of  shape  (£,-  «„  d),  with  views  into  the  array  for  each  feature  set.  This  form 
is  convenient  for  more  efficient  memory  access  or  for  operations  which  operate  pointwise  (like 
standardization). 

The  ski -groups  api  can  be  divided  into  several  sections: 

Features  The  Features  class  discussed  above. 

Preprocessing  A  collection  of  utilities  to  normalize,  scale,  standardize,  or  run  principal  com¬ 
ponents  analysis  on  each  set  in  a  collection  of  features.  These  are  wrappers  around  a  class 
BagPreprocesser,  which  helps  apply  transformers  to  each  set,  and  the  relevant  scikit-learn 
transformers. 

Summaries  Methods  that  convert  sets  into  single  feature  vectors: 

•  BagMean:  Represents  each  set  by  its  mean.  Especially  usefulin  conjunction  with  scikit-learn’s 
RBFSampler  to  perform  the  mmd  embedding  of  Section  4.1. 

•  BagOfWords:  Quantizes  each  set  into  the  bag  of  words  representation. 

^ttps : //github . com/dougalsutherland/skl- groups 
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•  L2DensityTransformer:  The  Li  embedding  of  Section  4.2. 

Set  kernels  Currently  contains  only  MeanMapKernel,  which  computes  the  pairwise  mmd  esti¬ 
mator. 

Divergences  KNNDivergenceEstimator,  which  can  estimate  Dajj  divergences  based  on  Poc- 
zos  and  Schneider  (201 1)  and  Poczos,  Xiong,  Sutherland,  et  al.  (2012),  the  kl  divergence  based 
on  Q.  Wang  et  al.  (2009),  and  an  estimator  for  the  Jensen-Shannon  divergence  based  on  Hino  and 
Murata  (2013). 

Kernel  utilities  Utilities  to  turn  a  divergence  into  an  rbf  kernel,  as  well  as  the  psd  corrections 
of  Section  2.4.1. 

Miscellaneous  utilities  Utilities  to  show  a  progress  bar  for  long-running  operations  like  the 
k-NN  divergence  estimator. 
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Appendix  B 

Proofs  for  Chapter  3 


B.l  Proof  of  Proposition  3.4 

Part  (i)  is  particularly  simple: 


EII/HJ  =  E  f  f{x,  yf  d nix,  y ) 
Jx- 


f{x,  y)2  d nix,  y) 


-r 

Jx 2 

=  f  77  [l  +  ki2x,  2 y)  -  2k(x,  y)2]  d nix,  y) 

J. xlD 


(B.l) 


where  (B.  1)  is  justified  by  Tonelli’s  theorem. 

For  part  (ii),  view  ||/||^  as  a  function  of  a>\, . .  .,o>d/ 2\  then  changing  to,  to  a  different  u>i 
changes  the  value  of  \\f\\M  by  at  most  8=B±i n(X2)  (as  will  be  shown  shortly).  The  first  inequality 

is  thus  a  direct  application  of  McDiarmid  (1989);  the  second  simply  notes  that 


D 


> 


32(2£>+l)2  -  288' 

To  show  the  claimed  bounded  deviation  property,  assume  without  loss  of  generality  that  we 
replace  co\  by  <o\\ 

;(cui,  (02,...,  COD/2 )  -  WfUm,  OJ2,  ■  ■  ■,  COD/ 2) 


L 


X2 


L 


2  2 0,2  \2 

-  cos(m}(.v  -  y))  +  -  J]  cos (coJ(x  -  y))  -  k(x,  y)  d/u(x,  y) 

1=2  / 

>  2  ^  V 

-  cos(m}(.v  ~y))+  j  cos (coJ(x  -  y))  -  k(x,  y)  d nix,  y) 
i-2  I 


A^c°s2(^-y))d^y)  +  ^  A^cos(cu;T(x-y))-/:(x,y)|  d/u(x,y) 

f  2  2  t\2  \ 

+2  —  cos(m[(^  -  y))  —  ^  cos (coj (x  -  y))  -  k(x,  y)  d nix,  y) 
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4 

Z)2 


I  r,  0/2 


f  cos2(cb](x-y))diu{x,y)-  f  ^-Y  cos(a>](x-y))-k(x,y)\dn(x,y) 

Jx 2  Jx 2  \u 

/*  2  2  0/2  \ 

-2  —  cos(m}(x  -  >’))  —  co&(a)](x  -  >’))  -  *(*,  y)  j  d/u(x,  y) 

(vj(x  -  y))  -  cos2(w}(.r  -  y))J  dfi(x,  y) 

J(x  -  y))  -  cos(d)}(x  -  y 

2d/u(x,  y)  +  ^  f 

u  Jx 


4 

D 2 


4 

+  D  Jx 2 


4/ 


1  2  0/2  \ 

-  Y  cos(coJ(x  -  y))  -  k(x,  y)  d/u(x,  y) 

\U  i= 2  I 


X2 

_8_  16 
If  +  ~D 


H{X2) 


lx 2 

16D  +  8 


D 2 


4d/y(x,  y) 

M*2). 


B.2  Proof  of  Proposition  3.5 


Part  (i)  is  exactly  analagous  to  that  for  /.  Part  (ii)  is  also  quite  similar: 


WfWM’  0J2,  ■  ■  ■ ,  0JD/  2)  -  WffMl,  0J2,  ■  ■  ■,  C0D/ 2) 


/2  (cos(m}(jc  -  y))  +  cos (m}0  +  y)  +  2/q)) 

2  0,2  \2 

+  D  X  [cos(w/"(*  “  y))  +  cos(d>T(*  +  y)  +  2fc*)]  -  k(x,  y)  d/i(x,  y) 

1=2  / 

- /2  (cos(d)}0  -  y))  +  cos(d)}0  +  y)  +  2Z?i)J 


D/2 


D 


Y  [cos(wT(*  -  y))  +  cos (to](x  -  y)  +  2bi)\  -  k(x,  y)  dju(x,  y) 


(=2 


4 
4 


J(x  -  y))  +  cos(a>i(x  +  y)  +  2 b,)J  -  ^cos(d)}(.r  -  y))  +  cos(di}(x  +  y)  +  2b, jj  J  d/u(x. 


+  - 


D  Jx 2 


J(x  -  y))  +  cos(<u}(jc  +  y)  +  2b j)  -  cos (to\(x  ~  y))  -  cos(u)}(a:  +  y)  +  2 bj)\ 


<2  0/2 


D  Yj  “  y))  +  COS (ojJ(x  +  y)  +  2 b,)\  -  k(x,  y)  |  d/u(x,  y) 

\  i= 2 


-  if  f  8  >o  +  4  /*  4x3  y) 

D~  Jx2  D  Jx2 
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B.3  Proof  of  Proposition  3.6 


The  proof  strategy  closely  follows  that  of  Rahimi  and  Recht  (200  );  we  fill  in  some  (important) 
details,  tightening  some  parts  of  the  proof  as  we  go. 

Let  Xa  =  {x  -  y  |  x,  y  £  X).  It’s  compact,  with  diameter  at  most  It,  so  we  can  find  an  e-net 
covering  Xa  with  at  most  T  =  (At/r)d  balls  of  radius  r  (Cucker  and  Smale  2001,  Proposition  5). 
Let  {A,-}f=1  denote  their  centers,  and  Lf  be  the  Lipschitz  constant  of  /.  If  |/(A,)|  <  e/2  for  all  i 
and  Lf  <  e/(2r),  then  |/(A)[  <  e  for  all  A  6  Xa- 


Let  Zi(x)  :=  [sin(£u;Tx)  cos^^x)]1,  so  that  z(.r)T2(y) 


j  d/ 2 

D 72  z  Zi(x)T  zi(y). 


B.3.1  Regularity  Condition 


We  will  first  need  to  establish  that  EVs(A)  =  VEs(A)  =  Vk( A).  This  can  be  proved  via  the 
following  form  of  the  Leibniz  rule,  quoted  verbatim  from  Cheng  (2013): 

Theorem  (Cheng  3,  Theorem  2).  Let  X  be  an  open  subset  o/R,  and  Q  be  a  measure  space. 
Suppose  /  :IxO^R  satisfies  the  following  conditions: 


1.  J'(x,  oj)  is  a  Lebesgue-integrable  function  of  oj  for  each  x  e  X. 

2.  For  almost  all  io  e  Q,  the  derivative  exists  for  all  x  e  X. 


3.  There  is  an  integrable  function  0  :  Q 


such  that 


df(xu>) 

d.x 


<  0(o»)/or  all  x  e  X. 


Then  for  all  x  £  X, 


Define  the  function  g\y  :  lx  Q  — »  R  by  g'xy(t,  co)  =  s(I)(x  +  tej,  y),  where  <?,  is  the  ith  standard 
basis  vector,  and  u>  is  the  tuple  of  all  the  oj-,  used  in  z.  g'XJ(t,  •)  is  Lebesgue  integrable  in  oj,  since 


/■ 


glx,y(L 


=  E  s(x  +  te i,  y)  =  k(x  +  tej,  y)  <  oo. 


For  any  co  £  Q,  jj~tglxy(t,  co)  exists,  and  satisfies: 


j/x.yiAx  of) 


2  0/2 

—  ^  coji  sin(u;Jy)  cos (a»|v  +  tcoji)  -  cop  cos(mjy)  sin (co^x  +  tcop) 
7  =  1 


2  0/2  \ 

<  Ew  —  ^  | iQjj  sin(o)Ty)  cosfcujx  +  tcojj )  +  ojjj  cos(mjy)  sinfmjx  +  tojji) 

7  =  ' 

_2  0,2 

-  E"  /)  ^  I'Vfi 

7  =  1 

<  2EW  K|, 

which  is  finite  since  the  first  moment  of  co\  is  assumed  to  exist. 

Thus  we  have  E  six,  y)  =  E  ^-,?(x,  y).  The  same  holds  for  y  by  symmetry.  Combining  the 
results  for  each  component,  we  get  as  desired  that  E  VAs(x,  y)  =  VA  E  ^(x,  y). 


B.3.2  Lipschitz  Constant 

Since  /  is  differentiable,  Lj  -  ||V/(A*)||,  where  A*  =  argmaxAg<YA  ||V/(A)||. 

Via  Jensen’s  inequality,  E  ||V.?(A)||  >  ||E  Vs(A)||.  Now,  letting  A*  =  x*  -  y*: 

E [Z2]  =e[||V.?(A*)-  Vk(A*)||2 

=  EA*  E  [||VS(A*)||2]  -  2  ||V*(A*)||  E  [  ||V?(A*)||  ]  +  ||Vk(A*)||2 

<  EA*  E  [||Vs(A*)||2]  -  2  ||  Vk(A*)||2  +  ||V^(A*)||2 

=  e[||V.?(A*)||2]  -  Ea»  [||V/:(A*)||2 

<  E  ||  V5(A*)||2 

=  E||Vz(x*)tz(/)||2 

D/2  2 

=  E  V— £*(x-)T*(/) 

U'A  i= 1 

=  E||Vzf(x*)Tzf(/)||2 

=  E  ||  V  cos(mTA*)||2 
=  E||-sin(a7TA*)co||“ 

=  E  sin2(cuTA*)  ||m||2 

<  E  ||a»||2  =  cr2. 


We  can  thus  use  Markov’s  inequality: 


(B.2) 
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B.3.3  Anchor  Points 


For  any  fixed  A  =  x  —  y,  f  (A)  is  a  mean  of  D/2  terms  with  mean  0  and  bounded  by  + 1 .  Applying 
HoefFding’s  inequality  and  a  union  bound: 


Pr 


*  T  \ 

(JlMOl  >  M  <  T  Pr  ( i/(A)  |  >  \e)  <  2Texp 

\(=t  / 


(f)2 

(l-(-D)2 


IT  exp  - 


Ds2 

16 


Alternatively,  since  we  know  from  (3.4)  that  the  variance  of  each  term  is  Var[cos(cuTA)]  = 
^  +  Jfk(2A)  -  k( A)2,  we  could  use  Bernstein’s  inequality: 


rPr  (|/(A)|  >  \s)  <2Texp 
=  2  T  exp 


De/_  \ 

_ 2_4 _ 

2  Var[cos(mTA)]  +  ||  J 

Ds 2 

16  Var[cos(cuTA)]  + 


(B.3) 


This  is  a  better  bound  when  Var[cos(mTA)]  +  \B  <  L  For  pixie  kernels,  Var[cos(ojTA)J  < 
so  the  Bernstein  bound  is  better  for  any  s  <  3.  Since  the  maximal  possible  error  is  s  =  2,  it  is 
essentially  always  better  for  pixie  kernels. 

To  unify  the  two  bounds,  let  aE  :=  min  1 1,  max^e,^  \  +  \k( 2A)  -  k( A2)  +  I.  Then 


T 


Pr 


(JlM)l 


<  2  T  exp 


\I  =  1 


Ds 1 

16  a. 


B.3.4  Optimizing  Over  r 

Combining  these  two  bounds,  we  have  a  bound  in  terms  of  r : 


Pr 


sup  |/(A)|  <  s  >  1  -  K\r  d  -  Kir 2 


AeA'a 


De 1 

\6a. 


= 


4cr-s 


2  _-2 


letting  k\  =  2(4/')^  exp 

If  we  choose  r  =  (k\/  Ki)l^d+2\  as  did  Rahimi  and  Recht  (2007),  the  bound  again  becomes 

2  d 

1  -2k/*1  k/+1  .  But  we  could  instead  maximize  the  bound  by  choosing  r  such  that  dK\r~d~l -2kit  = 


0,  i.e.  r 


Pr 


1 

d+2 


Then  the  bound  becomes  1  -  |  (  d 


d+2 


+ 


*) 


2 

d+2 


sup  |/(A)|  >  e  |  < 
As^a 


-d 

d  \  d+1 


+ 


f)"+2)(2(4^exp(- 


Ds1 


1 6m 


f.  d+2  f.  d+2  • 

*1  K2  . 


. 

d+2  ,  .  d 

4cr„V2)" 


d+2  /  d+2\  _  2+4d+2d  I  CT „£\  d+2 


)"“+W: 


2  d+2 


exp 


p 

Ds1 


8  (d  +  2  )aE 


(B.4) 
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)"+(*) 


2 

d+ 2 


6d+2 
2  ^+2 


o~p£ 


1+2 /d 


exp 


Df1 


S(d  +  2)a£ 


(B.5) 


For  e  <  crp£,  we  can  loosen  the  exponent  on  the  middle  term  to  2,  though  in  low  dimensions 
we  have  a  somewhat  sharper  bound.  We  no  longer  need  the  l  >  1  assumption  of  the  original 
proof. 

To  prove  the  final  statement  of  Proposition  3.6,  simply  set  (B.4)  to  be  at  most  8  and  solve  for 
D. 


B.4  Proof  of  Proposition  3.7 


We  will  follow  the  proof  strategy  of  Proposition  3.6  as  closely  as  possible. 

Our  approximation  is  now  s(x,  y )  =  z(x)T2(y),  and  the  error  is  f(x,  y )  =  s(x,  y)  -  k(y,  x).  Note 
that  s  and  /  are  not  shift- invariant:  for  example,  with  D  =  1,  s(jt,  y)  =  cos(wTA)  +  cos(a>T(x  + 
y)  +  2b)  but  s( A,  0)  =  cos(cuTA)  +  cos(mTA  +  2b). 


Let  q  = 


6  X1  denote  the  argument  to  these  functions.  X1  is  a  compact  set  in  E2t/  with 


diameter  X2£,  so  we  can  cover  it  with  an  e-net  using  at  most  T  =  balls  of  radius  r. 

Let  {qi}T=\  denote  their  centers,  and  Lf  be  the  Lipschitz  constant  of  /  :  R2rf  — >  R. 


B.4.1  Regularity  Condition 

To  show  EVs(<7)  =  V E s(q),  we  can  define  glxy(t,a>)  analogously  to  in  Appendix  B.3.1,  where 
here  co  contains  all  the  oj/  and  /?,-  variables  used  in  z,.  We  then  have: 


dglx,y(t,  oj) 


dt 


< 


^  -ojj,  cos(mJy  +  b  -)  sin(mjx  +  tio#  +  bj) 
7=1 

1  D 

7)EI“ 


<  N. 


which  we  have  assumed  to  be  finite. 


B.4.2  Lipschitz  Constant 

The  argument  follows  that  of  Appendix  B.3.2  up  to  (B.2),  using  q*  in  place  of  A*.  Then: 

E[L2]<E||V%*)||2 


E 

Vq  ^2cos(u7Tx  +  b)cos(ojTy  +  b) j 

2 

E 

Vx  ^2  cos(a»Tx  +  b)  cos (mTy  +  b) 

2 

+ 

Vv  ^2cos(cut.v  +  b)cos(cuiTy  +  b)j 
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-2  sin(cc>TJv:*  +  b )  cos(a>Ty*  +  b )  cull  +  11—2  cos(cuTv*  +  b)  sin(cuTy*  +  b)  cu"” 


4  ^sin2(cuT.r*  +  b)  cos 2(cuTy*  +  b)  +  cos2(coT.v*  +  b )  sin2(cuTy*  +  b)j  ||co| 
E/,  [2  -  cos(2cuT(.r*  -  y*))  -  cos(2 coJ(x*  +  y*)  +  4b)]  ||cu|'2 
2  -  cos(2cuT(.r*  -  y* 


\a>\ 


<  3E  ||co||~  =  3 <t~. 


Following  through  with  Markov’s  inequality: 

Pr  (Lj  >  e/(2r)J  <  3cr2(2 r/s)1  =  12(crpr / sf . 


B.4.3  Anchor  Points 

For  any  fixed  x,  y,  5  takes  a  mean  of  D  terms  with  expectation  k(x,  y)  bounded  by  ±2.  Using 
Hoeffding’s  inequality: 


Pr|[Jl/(«)l  ^  5eJ  -  rPr(l/(?)l  ^  \s)  ^  2T exP 


(2  -  (-2 ))2 


32 


Since  the  variance  of  each  term  is  given  by  (3.6)  as  Var[cos(cuTA)]  +  we  can  instead  use 
Bernstein’s  inequality: 


TPr 


> 


<  2 T  exp 


/ 


DT 


IT  exp 


Thus  Bernstein’s  gives  us  a  tighter  bound  if 


2  ^Var[cos(cuTA)]  +  +  |2| 

Ds 2 

4  +  8  Var[cos(cuTA)]  +  |e 


(B.6) 


8  2 
4  +  8  Var[cos(cuTA)]  +  -e  <  32  i.e.  2  Var[cos(cuTA)]  +  -s  <  7, 

which  since  Var[cos(coTA)]  <  1,  means  the  Bernstein  bound  is  better  for  any  e  <  7.5  no  matter 
the  kernel. 

Still,  it  can  be  preferable  to  have  a  bound  independent  of  e,  so  to  unify  the  bounds  define 
a'£  =  min  |l,  maxA  |  +  \  Var[cos(cuTA)]  +  -jWj;  then 


T 


Pr 


|J|/(<7/)l  >\e\<2 T exp 


\i=l 


Ds~ 

'32cZ 
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B.4.4  Optimizing  Over  r 

Our  bound  is  now  of  the  form 


Pr 


sup  \  f(q)\  <  s  |  >  1  -  Kir  2d  -  K2r2, 
\qeX2 


with 


=  2  (2V2f) 


2d 


exp 


De - 
32a; 


r 1  and  k2  =  12<t“£  2 . 


This  is  maximized  by  r  when  2dn\r 


-2d-\ 


-  2/or  =  0,  i.e.  r 


value  of  r  into  the  bound  yields  1  -  ( dd+i  +  d 


-d 


y+1  j  Kd+ 


_  _d_ 

d+1  Kd+l ,  and  thus: 


dK\  \ 
K2  ) 


1 

2d+2 


Substituting  that 


Pr 


sup  |/(g)|  >  ej  <  [dd+l  +  dd+ 1 2  ^2V2fj 


2d 


exp 


Ds 1 
’32< 


1 

d+ 1 


(l2 aje  2) 


d 

d+l 


2d 


—d  1  \  _  \+2d+d+2d  _  d 

d~dT\  +  dd+ T  2  3+1  3  3+T 


(<Tpt\ 

d+!  1 

u 

|  exp  1 

P 

De 2 


-d 


dd+i  +  dd+ 1  2  3+1  3d+i 


5d+l  _d_  (  CTp£\  1+1/3 


exp  - 


32(J  +  IK 
Ds1 


32 (J  +  IK  /  ‘ 


(B.7) 


As  before,  when  e  <  crp£  we  can  loosen  the  exponent  on  the  middle  term  to  2;  it  is  slightly 
worse  than  the  corresponding  exponent  of  (B.5)  for  small  d. 

To  prove  the  final  statement  of  Proposition  3.7,  set  (B.7)  to  be  at  most  6  and  solve  for  D. 


B.5  Proof  of  Proposition  3.8 


Consider  the  z  features,  and  recall  that  we  supposed  k  is  L-Lipschitz  over  Aa  :=  {x-y  \  x,y  e  X}. 

Our  primary  tool  will  be  the  following  slight  generalization  of  Dudley’s  entropy  integral, 
which  is  a  special  case  of  Lemma  13.1  of  Boucheron  et  al.  (2013).  (The  only  difference  from 
their  Corollary  13.2  is  that  we  maintain  the  variance  factor  v.) 

Theorem  (Boucheron  et  al.  Let  T  be  a  finite  pseudometric  space  and  let  be  a 

collection  of  random  variables  such  that  for  some  constant  v  >  0, 

logEe/l(x,-x,,)  <  ^vA2d2(t,t') 


for  all  t,t'  €  T  and  all  A  >  0.  For  any  to  6  T,  let  6  =  sup t£Td(t,  to);  then 


sup  Xt  -  Xt0 

teT 


y/H(u,T)du 


where  H(u,  T)  is  the  metric  entropy  ofT,  that  is,  the  logarithm  of  the  6-packing  number  N(u,  T). 
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Note  that,  although  stated  for  finite  pseudometric  spaces,  the  result  is  extensible  to  seperable 
pseudometric  spaces  (such  as  X&)  by  standard  arguments. 

The  d-packing  number  is  at  most  the  | -covering  number,  which  Proposition  5  of  Cucker  and 
Smale  (2001)  bounds.  Thus,  picking  Ao  =  0  gives  6  =  £,  H(8,X a)  <  d\og(%l  /  8),  and 

rt/2  _  /»£/2 

/  V H(u ,  A’a)  d u  <  /  V d  log(8^ / u)  du  =  yfVd, 

Jo  Jo 


where  y  :=  4-\//r  erfc(2->/log  2)  +  yjlog  2  «  0.964. 

Now,  g  (cos(cuTA)  -  k( A)  -  cos(cuTA')  +  k(A'))  has  mean  zero,  and  absolute  value  at  most 


2 

D 


^cos(m/"A)  -  k( A)  -  cos(coJ A')  +  k(A')j 


< 


< 


< 


r-% 

—  (|cos(wTA)  -  cos(ojJa')\  +  \k(A)  -  fc(A')|) 

^(IwTa-wTa'I  +  liia-a'ii) 

^(\\oji\\  +  L)\\A-A'\\.  (B.8) 


Thus,  via  Hoeffding’s  lemma  (Boucheron  et  al.  2013,  Lemma  2.2),  each  such  term  has  log 
moment  generating  function  at  most  JjGI^II  +  L)2A2\\A  -  A' 1 1 2 . 

This  is  almost  in  the  form  required  by  Dudley’s  entropy  integral,  except  that  o»(-  is  a  random 
variable.  Thus,  for  any  r  >  0,  define  the  random  process  gr  which  is  distributed  as  /  except  we 
require  that  ||a»i||  =  r  and  ||m,  ||  <  r  for  all  i  >  1.  Since  log  mgfs  of  independent  variables  are 
additive,  we  thus  have 


logEe 


't(i?r(A)-fr(A'))  J_ 


D 


9  012  \  i 

-  £(M+L)2  ,i2||A  -  A'||2  <  i(r  + 
\  i=l  / 


gr  satisfies  the  conditions  of  the  theorem  with  v  =  j-}(r  +  L )2.  Now, 

12  y\[dt 


sup  gr( A) 
As-Ya 


<  - r  +  L). 

Vd 


L)2T2||A-  A'll2. 
gr( 0)  =  0,  so  we  have 


But  the  distribution  of  /  conditioned  on  the  event  max,  ||oj,  j|  =  r  is  the  same  as  the  distribution 
of  gr.  Thus 


E  sup  /  =  Er  [E[sup  gv]]  <  Er 


X2yVd£ 

Vd 


0 r  +  L ) 


12  yVd£ 

Vd 


(R  +  L) 


where  R  :=  Emax^2||aj,||. 

The  same  holds  for  Esup (-/).  Since  we  have  sup  /  >  0,  sup (-/)  >  0,  the  claim  follows 
from  E  [max(sup  /,  sup(-/))]  <  E  [sup  /  +  sup(-/)] . 
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B.6  Proof  of  Proposition  3.9 


For  the  z  features, the  error  process  again  must  be  defined  over  A2  due  to  the  non-shift  invariant 
noise.  We  still  assume  that  k  is  L-Lipschitz  over  Aa,  however. 

Compared  to  the  argument  of  Appendix  B.5,  we  have  H(u,X 2)  <  2Jlog  ^4-V2£/wj.  Unlike 

Xa,  however,  X2  does  not  necessarily  contain  an  obvious  point  qo  to  minimize  sup?€^2  d(q,  qo), 
nor  an  obvious  minimal  value.  We  rather  consider  the  “radius”  p  :=  sup^  d (x,  xq),  achieved 
by  any  convenient  point  xq ;  then  supgg^2  d  (q,  (vq,  xq))  =  V2p.  Note  that  <  p  <  i,  where  the 
lower  bound  is  achieved  by  A  a  ball,  and  the  upper  bound  by  A  a  sphere.  The  integral  in  the 
bound  is  then 


<  [Pl  2  J 2d  \og(4y/2£/u) 

Jo 

=  4yfnd£  erfc  ^  log  2  +  log  4-V2^  j  +  pVd^j-  log  2  +  log  ^ 

=  ^V^rerfc  log  2  +  log  4V2^J  +  £  log  2  +  log  Wd. 


(B.9) 


Calling  the  term  in  parentheses  y'^  ,  we  have  that  y'  ~  1.541,  y'  «  0.803,  and  it  decreases 
monotonically  in  between,  as  shown  in  Figure  B.l. 


V 


Figure  B.l:  The  coefficient  of  (B.9)  as  a  function  of  p. 


We  will  again  use  the  notation  of  q  =  (x,  y )  6  A2,  A  =  x-y,t  =  x  +  y.  Each  term  in  the  sum 
of  f(q)  -  f(q')  has  mean  zero  and  absolute  value  at  most 


1 


—  |cos (coj  A)  +  cos^-1 1  +  2b j)  -  k( A)  -  cos(coJ A')  +  cos(m;-'  t'  +  2b  j)  +  k(A')\ 


D 


<  ^|cos(cu;TA)  -  cos(a>;TA')|  +  |cos(cuJt  +  2b j)  -  cos (cojt'  +  2bj)\  +  \k(A)  - 
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<  5  (Mil  A  -  A'll  +  IKIIIIt  -  r'll  +  L||  A  -  A'll) . 

Now,  in  order  to  cast  this  in  terms  of  distance  on  X 2,  let  8X  =  x  -  x' ,  8y  =  y  -  y' .  Then 

ll«-«'ll2  =  W2  +  ll«,ll2 

(||A  -  A'||  +  ||t  -  ('ll)2  =  U\\6_I\\1  +  \\K\\2-2Sj6y  +  -/||<5V||2  +  ll«,ll2  + 

=  2||,5,||2  +  2||,5,||2  +  27(ll#,ll2+ll«,ll2)2-4(tfI<S,)2 
<  4  (||<5,||2  +  ||<S,||2) 

||  A  -  A'||  +  ||f  -  i'll  <  2|| <7  -  q'\\ 

II A  -  A'||  <  2\\q  -  q'\\ 


and  so  each  term  in  the  sum  of  f(q )  -  f(q')  has  absolute  value  at  most  (||cu,||  +  L )  \\q  -  q'\\. 
Note  that  this  agrees  exactly  with  (B.8),  but  the  sum  in  f{q)  -  f(q')  has  D  terms  rather  than  D/2. 
Defining  gr  analogously  to  gr,  we  thus  get  that 


,  d 


logEe 


-t(lr(?)-|r(?')) 


< 


D 


1  y  /II  .  .  II  .  r  \2  I  ,2,1  _  „,||2  2 

n  /-i 


kill  +  Lf  |  >Y\\c,  -  ,j'||2  <  ^(r  +  L)242||?  -  <,'||2, 


\D  (=. 


and  the  conditions  of  the  theorem  hold  with  v  =  ^(r  +  L)-.  Note  that  ~Egr(qo)  =  0.  Carrying  out 
the  rest  of  the  argument,  we  get  that 


-  sup  /  =  Er[E[sup  gr\\  < 


24  /3{ip£'\[d 

Vd 


0 r  +  L ) 


24  pc/piyfd 

Vd 


(R  +  L), 


and  similarly  for  E  sup  /.  We  do  not  have  a  guarantee  that  f(q )  does  not  have  a  consistent  sign, 
and  so  our  bound  becomes 


EH/lloo  <E 


< 


Vd 


/  crosses  0]  Pr  crosses  oj  +  3  Pr  does  not  cross  oj 

)■ 


+  L)  Pr  (/  crosses  oj  +  3  Pr  ( /  does  not  cross  0 


Pr  crosses  oj  is  extremely  close  to  1  in  “usual”  situations. 
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Appendix  C 

Proofs  for  Chapter  4 

C.l  Proof  of  Proposition  4.10 

We  will  now  prove  the  bound  on  the  error  probability  of  our  embedding 

Pr  (lA'Cp,  q)  -  z(A(p))J z(A(q))\  >  ej 


for  fixed  densities  p  and  q. 


Setup  We  will  need  a  few  assumptions  on  the  densities: 

1.  p  and  q  are  bounded  above  and  below:  for  *  6  [0,  l]d,  0  <  p*  <  p(x),  q(x)  <  p*  <  oo. 

2.  p,qe  2.(f3.  Lp)  for  some  f3,  Lp  >  0.  I(/3,  L )  refers  to  the  Holder  class  of  functions  /  whose 
partial  derivatives  up  to  order  are  continuous  and  whose  rth  partial  derivatives,  where 
r  is  a  multi-index  of  order  L/3J,  satisfy  | D'f(x)  -  D'f(y)\  <  L \\x  -  y||^.  Here  [f3\  is  the 
greatest  integer  strictly  less  than  [3. 

3.  p,  q  are  periodic. 

These  are  fairly  standard  smoothness  assumptions  in  the  nonparametric  estimation  literature. 

Let  y  =  min(/3, 1).  If  (3  >  1,  then  p,qe  £(l,Ly)  for  some  Lr;  otherwise,  clearly  p,q  6 
S(yS,  Lp).  Then,  from  assumption  3 ,p,q  e  Lper(y,  Ly),  the  periodic  Holder  class.  We’ll  need  this 
to  establish  the  Sobolev  ellipsoid  containing  p  and  q. 

We  will  use  kernel  density  estimation  with  a  bounded,  continuous  kernel  so  that  the  bound 
of  Gine  and  Guillou  (200  )  applies,  with  bandwidth  h  x  n~]IAP+d)  ]()g  /7_  anc]  truncating  density 
estimates  to  [p*,  p*]. 

We  also  use  the  Fourier  basis  (fa  =  exp  (2i^arT.v),  and  define  V  as  the  set  of  indices  a  s.t. 
Z‘l=1  \ctj\2s  <  t  for  parameters  0  <  ,v  <  1,  t  >  0  to  be  discussed  later. 


Decomposition  Let  r<T( A)  =  exp  (-A2/(2 cr2)).  Then 
\K(p,q)-z(A(p))Jz(A(q))\< 

Kip,  q)  ~  Lt*  (||^(p)  -  A(p)||) 


II A(p)  - 


~  z(A(p))T  z(A(q)) 
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The  latter  term  was  bounded  by  Chapter  3.  For  the  former,  note  that  rtr  is  ^P=-Lipschitz,  so  the 
first  term  is  at  most  — l-=  \p(p,  q)  -  ||  A(p)  -  A(g)||  |.  Breaking  this  up  with  the  triangle  inequality: 

-  IIAfp)  -  A(g)|||  <  \l'(p.‘l!  -  +  I p{p,  9)-  l|i,Mp)  -  *(<yi||| 

+  IIIW)  -  </K?)ll  -  II A(P)  -  AOjUII  +  |||A(,5)  -  A($)||  -  ||A(p)  -  A«)||| .  (C.l) 


Estimation  error  Recall  that  p  is  a  metric,  so  the  reverse  triangle  inequality  allows  us  to  address 
the  first  term  with 

I  pip,  q)  -  piP ,  q)\  <  pip,  p)  +  piq,  q). 

For  p2  the  total  variation,  squared  Hellinger,  or  Jensen-Shannon  hdds,  we  have  that  p2(p,  q)  < 
t v(p,p)  (J.  Lin  1991).  Moreover,  as  the  distributions  are  supported  on  [0,  l]rf,  t v(p,p)  = 

\  lb -rIIi  ^  \  IIr-rIL- 

It  is  a  consequence  of  Gine  and  Guillou  (2002)  that,  for  any  6  >  0, 


Pr 


\\P~, 


> 


yjC5  log  77 

n/3/(2 0+d) 


<  6 


for  some  Q  depending  on  the  kernel.  Thus 


Pr  (I pip,  q)  ~  piP ,  <?)l  >  e)  <  2C  1 


£4n2/3/(2p+d)  \ 

4  log  n  ) 


,  where  Cc- i(x)  =  x. 


A  approximation  The  second  term  of  (C.l),  the  approximation  error  due  to  sampling  As, 
admits  a  simple  Hoeffding  bound.  Note  that  || -  q^\\  +  \\p^  -  q\\\  ,  viewed  as  a  random 
variable  in  A  only,  has  expectation  p2(p,q)  and  is  bounded  by  [0, 4Z]  (where  Z  =  f:  dp(T)): 

write  it  as  Z  f\p(x)2+lA  -  q(x) 2+1/l|2  d.r,  expand  the  square,  and  use  f  ^Jp(x)qix) dx  <  1  (via 
Cauchy-Schwarz). 

For  nonnegative  random  variables  X  and  Y ,  Pr(|X  -  Y\  >  s)  <  Pr  {\X2  -  Y2 1  >  s2),  so  we 
have  that  Pr(|||t/r(p)  -  ipiq) ||  -  p(p,  q) |  >  s)  is  at  most  2exp(-Me4/(8Z2)). 


Tail  truncation  error  The  third  term  of  (C.l),  the  error  due  to  truncating  the  tail  pro- 
ection  coefficients  of  the  ps{  functions,  requires  a  little  more  machinery.  First  note  that 

II<A(r)  -  iA(<?)||2  -  II Aip)  -  A(<?)||2  is  at  most 


M 


E  Z  Z 

7=1  S=R,I  aiV 


Let  'Wis,  L )  be  the  Sobolev  ellipsoid  of  functions  Yja&zd  aalPa  such  that  Yua&zd  (£q=i \uj  i 
L,  where  </?  is  still  the  Fourier  basis.  Then  Lemma  14  of  Krishnamurthy  et  al.  ( 101  )  shows  that 


(EjCW2') 


< 


Zper (y,  Ly)  c  TV(^,  L ')  for  any  0  <  s  <  y  and  L'  =  dL~(2n) 


-2Lrj  _4L 


47-4* 
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So,  suppose  that  p,q  e  Sper(y,  L )  with  probability  at  least  1-6.  Since  .v  i— >  .r2+1/i  is 

Lipschitz  on  [p*,  oo),  psx  e  £per  |y,  ^Vl  +  4T 2  L  p*  ~  j  and  so  p1^  -  is  in  TV^,  (1  +  4A2)L')  for 
s  <  y  and  L'  =  dL2p;l/(  1  -  4*“?). 

Recall  that  we  chose  V  to  be  the  set  of  a  e  Zd  such  that  XjLi  \aj l2,s  -  t.  Thus 


Yj\aa(pSX 

a$V 


qSA)\2  <  Yj\ a*(PSi 

atV 


<  ( \+4A2)L'/t . 


A 


The  tail  error  term  is  therefore  at  least  e  with  probability  no  more  than 

M 

■5+2I>  ((1  +4A2j)L'/t  >  s2/(2M)]j  . 
j= 1 

The  latter  probability,  of  course,  depends  on  the  choice  of  hdd  p.  Letting  £  =  te2 /(SM L')  -  it 
is  1  if  £  <  0  and  1  -  p  ([0,  V?])  /Z  otherwise.  If  £  >  0,  squared  Hellinger’s  probability  is  0,  and 
total  variation’s  is  |  arctan(V?)-  A  closed  form  for  the  cumulative  distribution  function  for  the 
Jensen-Shannon  measure  is  unfortunately  unknown. 


Numerical  integration  error  The  final  term  of  (C.l)  also  bears  a  Hoeffding  bound.  Define  the 
projection  coefficient  difference  ( p,  q)  =  aaj(psA  -  aa(qs{),  and  A  similarly  but  with  a.  Then 


M 

\\A(p)  -  A(q)\\2  -  ||A(p)  -  A(<?)||2 

j=lS=R,IaeV 


A5, 

a, A, 


(p>4) 


q) 


(C.2) 


Letting  e(p)  =  aa(px)-aa(px),  each  summandis  atmost(e(p)  +  e(^))2+2  A sAa(p,q)  ( i(p)  +  e(q )). 
Aiso,  A SaA(p,q)  <  2 VZ,  using  Cauchy-Schwarz  on  the  integral  and  ||  II2  =  1-  Thus  each 

summand  in  (C.2)  can  be  more  than  s  only  if  one  of  the  es  is  more  than  yjz  +  e/4  -  Vz. 

Now,  using  (4.10),  aa(p $)  is  an  empirical  mean  of  ne  independent  terms,  each  with  absolute 
value  bounded  by  (Vp*  +  1)  maxjr|^>Q.(jc)|  =  Vp*  +  C  Thus,  using  a  Hoeffding  bound  on  the  es,  we 


getthatPr  ^|||A(p)  -  A(q)  ||2  -  \\A(p)  -  A(<7)||2|  >  is  no  more  than  8MS  exp 


;(Vz+e2/(8S)-VzV 

2Z(Vp*  +  l)2 


Final  bound  Combining  the  bounds  for  the  decomposition  (C.l)  with  the  pointwise  rate  for 
rks  features,  we  get: 


Pr  (| K(p,  q)  -  z{A{pf  z{A{q))\  >  e)  < 
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for  any  eRKs  + 


2  exp  -Ds^KS  +  2 C 


-.-l 


/  4  n2f3/(2fi+d)  \ 

1  bKDEfl 


4  log  77 


+  2  exp  \-Me\l{8Z‘ 


I 

/ 

+  6  +  2M 

1-/7 

o,A 

max 

a 

\ 

\ 

V 

P*t£ldi\  4 r  -  4‘s  1 

8MdL2  4?  4 

+  8M  |V|  exp 


/ 

-\ne 

<^+slJ(SW\Z)-l) 

2\ 

i  Vp^+  i 

l  ) 

(^KDE  +  £/l  +  £tail  +  eint)  ^  £- 
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